apache pig - How to read empty field with Regex in Pig? -
i'm trying parse custom formatted log file looks this:
2016-11-05 20:00:00,007 [some$tr!ng_nowhitespace.here] info sin.my.package.objectname timetotal=73 timefirst=73 dev="iphone" 2016-11-05 20:00:02,010 [some$tr!ng_nowhitespace.here/too] info sin.my.package.objectname timetotal=350 timefirst=105 timesecond=245 dev="android" 2016-11-05 20:00:10,207 [some$tr!ng_nowhitespace.here/anothertime] info sin.my.package.objectname timetotal=420 timefirst=100 timesecond=205 timethird=115 dev="ipad"
notice field timefirst=
constant log lines, timesecond=
, timethird=
may or may not present.
i using following pig script parse log lines myregexloader()
;
data = load '/path/to/raw/file.lzo' using org.apache.pig.piggybank.storage.myregexloader('([0-9]{4})-([0-9]{2})-([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2}),([0-9]+) \\[(\\s+)\\] ([a-z]+) (\\s+) timetotal=([0-9]+) timefirst=([0-9]+) (timesecond=([0-9]+) )?(timethird=([0-9]+) )?dev="(\\w+)"') (year: int, month: int, date: int, hour: int, mins: int, sec: int, bytesize: int, blockstr: chararray, msgflag: chararray, objectstr: chararray, timetotal: int, timefirst: int, timesecond: int, dev: chararray); store data '/user/myuser/pigdumps/pigdump1/' using pigstorage(',');
i know going wrong @ how parsing timesecond=
, timethird=
, that's best limited regex , pig knowledge. here's error logs console:
input(s): failed read data "/path/to/raw/file.lzo" output(s): failed produce result in "/user/myuser/pigdumps/pigdump1" counters: total records written : 0 total bytes written : 0 spillable memory manager spill count : 0 total bags proactively spilled: 0 total records proactively spilled: 0 job dag: job_1478169073918_75270 2016-11-17 16:24:35,503 [uber-subtaskrunner] info org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - failed! 2016-11-17 16:24:35,516 [uber-subtaskrunner] error org.apache.pig.tools.grunt.gruntparser - error 2999: unexpected internal error. null
any workaround or in looking @ right direction appreciated.
thanks!
p.s.: working private data, , not provide original samples. synthetic data made replicate problem best way possible. please pardon mistakes in synthesizing, , let me know. shall correct them.
Comments
Post a Comment