pyspark - How to Reduce Nested Dictionaries in Spark -
(in 'pyspark') have rdd contains multiple dictionaries. each of these dictionaries, in turn, contain multiple dictionaries. looks this:
label1 : {tag1, : count = 2, tag2: count = 3}, {tag2 : count = 3}, {tag3 : count = 1}, ... label2 : {tag1, : count = 2, tag3: count = 2}, {tag2 : count = 5}, {tag4 : count = 3}, ... . .
given structure, i'd able "reduce" dictionaries result has following form:
label1 : {tag1 : count = 2}, {tag : count = 6}, {tag3 : count = 1} ... label2 : {tag1 : count = 2}, {tag2 : count = 5}, {tag3 : count = 2}, {tag4 : count = 3}... . . .
i have feeling resembles 'reduce' or 'combine' or 'groupby' having difficulty finding right function. can please point me function, in spark, might accomplish task? thanks!
this should flatten iterator of dictionaries 1 big dictionary:
def combine(iter): bigdict = dict() littledict in iter: key, value in littledict.iteritems(): bigdict[key] = value return bigdict rdd.map(combine)
Comments
Post a Comment