apache spark - How to get the difference between two RDDs in PySpark? -


i'm trying establish cohort study track in-app user behavior , want ask if have idea how can exclude element rdd 2 in rdd 1. given :

rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)])  rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")]) 

for exemple, have common element between rdd1 , rdd2, have :

rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect() 

which gives :

[('a', (2, '6play'))] 

so, join find common element between rdd1 , rdd2 , take key , values rdd2 only. want opposite : find elements in rdd2 , not in rdd1, , take key , values rdd2 only. in other words, want items rdd2 aren't present in rdd1. expected output :

("c", "bobo") 

ideas ? thank :)

i got answer , it's simple !

rdd2.subtractbykey(rdd1).collect() 

enjoy :)


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -