apache spark - How to load only the data of the last partition -
i have data partitioned way:
/data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2016/month=10/version=1 /data/year=2016/month=10/version=2 /data/year=2016/month=10/version=3 /data/year=2016/month=11/version=0 /data/year=2016/month=11/version=1
when using data, i'd load last version of each month.
a simple way load("/data/year=2016/month=11/version=3")
instead of doing load("/data")
.
drawback of solution loss of partitioning information such year
, month
, means not possible apply operations based on year or month anymore.
is possible ask spark load last version of each month? how go this?
well, spark supports predicate push-down, if provide filter
following load
, read in data fulfilling criteria in filter
. this:
spark.read.option("basepath", "/data").load("/data").filter('version === 3)
and keep partitioning information :)
Comments
Post a Comment