pdf - iText LocationTextExtractionStrategy/HorizontalTextExtractionStrategy splits text into single characters -


i used extended version of locationtextextractionstrategy extract connected texts of pdf , positions/sizes. did using locationalresult. worked until tested pdf containing texts different font (ttf). these texts splitted single characters or small fragments.

for example "detail" not more 1 object within locationalresult list splitted 6 items (d, e, t, a, i, l)

i tried using horizontaltextextractionstrategy making getlocationalresult method public:

public list<textchunk> getlocationalresult() {     return (list<textchunk>)locationalresultfield.getvalue(this); } 

and using pdfreadercontentparser extract texts:

reader = new pdfreader("some_pdf"); pdfreadercontentparser parser = new pdfreadercontentparser(reader); var strategy = parser.processcontent(i, horizontaltextextractionstrategy());  foreach (horizontaltextextractionstrategy.horizontaltextchunk chunk in strategy.getlocationalresult()) {     // chunk      } 

but returns same result. there other way extract connected texts pdf?

i used extended version of locationtextextractionstrategy extract connected texts of pdf , positions/sizes. did using locationalresult. worked until tested pdf containing texts different font (ttf). these texts splitted single characters or small fragments.

this problem due wrong expectations concerning contents of locationtextextractionstrategy.locationalresult private list member variable.

this list of textchunk instances contains pieces of text forwarded strategy parsing framework (or preprocessed filter class), , framework forwards each single string encounters in content stream separately.

thus, if seemingly connected word in content stream drawn using multiple strings, multiple textchunk instances it.

there "intelligence" in method getresultanttext joining these chunks properly, adding space necessary , on.

in case of document, "detail " drawn this:

[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] tj  

as see there slight text insertion point moves between 'd' , 'e', 't' , 'a', 'i' , 'l', , 'l' , ' '. (such mini moves represent kerning.) thus, you'll individual textchunk instances 'd', 'et', 'ai', , 'l '.

admittedly, locationtextextractionstrategy.locationalresult member not documented; private member, imho forgivable.


that this worked well many documents due many pdf creators not applying kerning , drawing connected text using single string objects.


the horizontaltextextractionstrategy derived locationtextextractionstrategy , differs in way arranges textchunk instances single string. thus, you'll see same fragmentation here.


is there other way extract connected texts pdf?

if want "connected texts" in "atomic string objects in content stream", have them.

if want "connected texts" in "visually connected texts, no matter constituent letters drawn in content stream", have glue textchunk instances locationtextextractionstrategy , horizontaltextextractionstrategy in getresultanttext in combination comparison methods in respective textchunklocationdefaultimp , horizontaltextchunklocation implementations.


Comments

Popular posts from this blog

account - Script error login visual studio DefaultLogin_PCore.js -

xcode - CocoaPod Storyboard error: -