java - charset of the attribute values in Jsoup -


i use jsoup , need pick attribute values of tags inside html document in ascii-encoding maintaining them are, without converting them.

so, have following html document

<!doctype html> <head>     <meta charset="ascii">         </head> <body>     <div title="2 &gt; 1, 1 > 0, &agrave; vs &egrave;">         3 &gt; 2,  1 > 0     </div> </body> 

which want parse means of jsoup.

i need extract value of title attribite is: 2 &gt; 1, 1 > 0, &agrave; vs &egrave;.

i've create document object doc below (it in kotlin, don't think important here):

val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.outputsettings().charset(charset) 

when print out doc means of

println(doc.tostring()) 

i following string

<!doctype html> <html>  <head>    <meta charset="ascii">   </head>   <body>    <div title="2 > 1, 1 > 0, &agrave; vs &egrave;">     3 &gt; 2    </div>   </body> </html> 

which differs file content title attribute value (&gt; gets transformed > in string "2 > 1"), while rest of document ok.

then, inspecting attribute value

 doc.body().select("div").foreach { div -> println("title = ${div.attr("title")}") } 

produces following string

title = 2 > 1, 1 > 0, à vs è 

notice, &agrave; , &egrave; transformed à , è.

my question is: in jsoup, how can attribute values of html tags preserving way written in input file?

in example above need string "2 &gt; 1, 1 > 0, &agrave; vs &egrave;" (as written in input file) , not "2 > 1, 1 > 0, &agrave; vs &egrave;" niether "2 &gt; 1, 1 &gt; 0, à vs è".

the attr() method returns string without html entities , not find way keep html entities. however, can use jsoup.clean() method convert characters in string entities.

val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.body().select("div").foreach { div ->     val title = jsoup.clean("${div.attr("title")}", "", whitelist.none(), document.outputsettings().charset(charset))     println("title = $title") } 

the result is:

title = 2 &gt; 1, &agrave; vs &egrave; 

of course, might not solution use case.


Comments

Popular posts from this blog

xcode - CocoaPod Storyboard error: -

c# - AutoMapper - What's difference between Condition and PreCondition -