apache spark - How to get the difference between two RDDs in PySpark? -

- July 15, 2011

i'm trying establish cohort study track in-app user behavior , want ask if have idea how can exclude element rdd 2 in rdd 1. given :

rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])  rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])

for exemple, have common element between rdd1 , rdd2, have :

rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()

which gives :

[('a', (2, '6play'))]

so, join find common element between rdd1 , rdd2 , take key , values rdd2 only. want opposite : find elements in rdd2 , not in rdd1, , take key , values rdd2 only. in other words, want items rdd2 aren't present in rdd1. expected output :

("c", "bobo")

ideas ? thank :)

i got answer , it's simple !

rdd2.subtractbykey(rdd1).collect()

enjoy :)

Search This Blog

QR

apache spark - How to get the difference between two RDDs in PySpark? -

Comments

Post a Comment

Popular posts from this blog

java - .class files under target/classes folder Maven -

linux - Could not find a package configuration file provided by "Qt5Svg" -

simple.odata.client - Simple OData Client Unlink -