apache spark - How to load only the data of the last partition -


i have data partitioned way:

/data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2016/month=10/version=1 /data/year=2016/month=10/version=2 /data/year=2016/month=10/version=3 /data/year=2016/month=11/version=0 /data/year=2016/month=11/version=1

when using data, i'd load last version of each month.

a simple way load("/data/year=2016/month=11/version=3") instead of doing load("/data").
drawback of solution loss of partitioning information such year , month, means not possible apply operations based on year or month anymore.

is possible ask spark load last version of each month? how go this?

well, spark supports predicate push-down, if provide filter following load, read in data fulfilling criteria in filter. this:

spark.read.option("basepath", "/data").load("/data").filter('version === 3) 

and keep partitioning information :)


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -