apache spark - How to get the difference between two RDDs in PySpark? -
i'm trying establish cohort study track in-app user behavior , want ask if have idea how can exclude element rdd 2 in rdd 1. given :
rdd1 = sc.parallelize([("a", "xoxo"), ("b", 4)]) rdd2 = sc.parallelize([("a", (2, "6play")), ("c", "bobo")])
for exemple, have common element between rdd1 , rdd2, have :
rdd1.join(rdd2).map(lambda (key, (values1, values2)) : (key, values2)).collect()
which gives :
[('a', (2, '6play'))]
so, join find common element between rdd1 , rdd2 , take key , values rdd2 only. want opposite : find elements in rdd2 , not in rdd1, , take key , values rdd2 only. in other words, want items rdd2 aren't present in rdd1. expected output :
("c", "bobo")
ideas ? thank :)
i got answer , it's simple !
rdd2.subtractbykey(rdd1).collect()
enjoy :)
Comments
Post a Comment