Reading Xml file in PySpark via Hadoop Streaming -
i'm trying adapt code here scala version pyspark version. here's code i'm using:
conf = sparkconf().setappname("parse xml file") sc = sparkcontext(conf = conf) sqlcontext = hivecontext(sc) sc._jsc.hadoopconfiguration().set('stream.recordreader.class', 'org.apache.hadoop.streaming.streamxmlrecordreader') sc._jsc.hadoopconfiguration().set('stream.recordreader.begin', '<page>') sc._jsc.hadoopconfiguration().set('stream.recordreader.end', '</page>') xml_sdf = sc.newapihadoopfile(xml_data_path, 'org.apache.hadoop.streaming.streaminputformat', 'org.apache.hadoop.io.text', 'org.apache.hadoop.io.text') print("found {0} records.".format(wiki_xml_sdf.count())) sc.stop()
error i'm getting is:
py4j.protocol.py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.newapihadoopfile. : java.lang.classcastexception: org.apache.hadoop.streaming.streaminputformat cannot cast org.apache.hadoop.mapreduce.inputformat
is there different input format / settings can use make work?
the easiest solution use spark-xml package. in case (all documents start <page>
) below code load data dataframe:
sqlcontext.read.format('com.databricks.spark.xml') .options(rowtag='page').load('samplexml.xml')
Comments
Post a Comment