pdf - iText LocationTextExtractionStrategy/HorizontalTextExtractionStrategy splits text into single characters -
i used extended version of locationtextextractionstrategy extract connected texts of pdf , positions/sizes. did using locationalresult. worked until tested pdf containing texts different font (ttf). these texts splitted single characters or small fragments.
for example "detail" not more 1 object within locationalresult list splitted 6 items (d, e, t, a, i, l)
i tried using horizontaltextextractionstrategy making getlocationalresult method public:
public list<textchunk> getlocationalresult() { return (list<textchunk>)locationalresultfield.getvalue(this); }
and using pdfreadercontentparser extract texts:
reader = new pdfreader("some_pdf"); pdfreadercontentparser parser = new pdfreadercontentparser(reader); var strategy = parser.processcontent(i, horizontaltextextractionstrategy()); foreach (horizontaltextextractionstrategy.horizontaltextchunk chunk in strategy.getlocationalresult()) { // chunk }
but returns same result. there other way extract connected texts pdf?
i used extended version of
locationtextextractionstrategy
extract connected texts of pdf , positions/sizes. did usinglocationalresult
. worked until tested pdf containing texts different font (ttf). these texts splitted single characters or small fragments.
this problem due wrong expectations concerning contents of locationtextextractionstrategy.locationalresult
private list member variable.
this list of textchunk
instances contains pieces of text forwarded strategy parsing framework (or preprocessed filter class), , framework forwards each single string encounters in content stream separately.
thus, if seemingly connected word in content stream drawn using multiple strings, multiple textchunk
instances it.
there "intelligence" in method getresultanttext
joining these chunks properly, adding space necessary , on.
in case of document, "detail " drawn this:
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] tj
as see there slight text insertion point moves between 'd' , 'e', 't' , 'a', 'i' , 'l', , 'l' , ' '. (such mini moves represent kerning.) thus, you'll individual textchunk
instances 'd', 'et', 'ai', , 'l '.
admittedly, locationtextextractionstrategy.locationalresult
member not documented; private member, imho forgivable.
that this worked well many documents due many pdf creators not applying kerning , drawing connected text using single string objects.
the horizontaltextextractionstrategy
derived locationtextextractionstrategy
, differs in way arranges textchunk
instances single string. thus, you'll see same fragmentation here.
is there other way extract connected texts pdf?
if want "connected texts" in "atomic string objects in content stream", have them.
if want "connected texts" in "visually connected texts, no matter constituent letters drawn in content stream", have glue textchunk
instances locationtextextractionstrategy
, horizontaltextextractionstrategy
in getresultanttext
in combination comparison methods in respective textchunklocationdefaultimp
, horizontaltextchunklocation
implementations.
Comments
Post a Comment