Reading Xml file in PySpark via Hadoop Streaming -


i'm trying adapt code here scala version pyspark version. here's code i'm using:

    conf = sparkconf().setappname("parse xml file")     sc = sparkcontext(conf = conf)     sqlcontext = hivecontext(sc)      sc._jsc.hadoopconfiguration().set('stream.recordreader.class', 'org.apache.hadoop.streaming.streamxmlrecordreader')     sc._jsc.hadoopconfiguration().set('stream.recordreader.begin', '<page>')     sc._jsc.hadoopconfiguration().set('stream.recordreader.end', '</page>')      xml_sdf = sc.newapihadoopfile(xml_data_path,                                        'org.apache.hadoop.streaming.streaminputformat',                                        'org.apache.hadoop.io.text',                                        'org.apache.hadoop.io.text')     print("found {0} records.".format(wiki_xml_sdf.count()))      sc.stop() 

error i'm getting is:

py4j.protocol.py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.newapihadoopfile. : java.lang.classcastexception: org.apache.hadoop.streaming.streaminputformat cannot cast org.apache.hadoop.mapreduce.inputformat 

is there different input format / settings can use make work?

the easiest solution use spark-xml package. in case (all documents start <page>) below code load data dataframe:

sqlcontext.read.format('com.databricks.spark.xml')     .options(rowtag='page').load('samplexml.xml') 

Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -