java - charset of the attribute values in Jsoup -
i use jsoup , need pick attribute values of tags inside html document in ascii-encoding maintaining them are, without converting them.
so, have following html document
<!doctype html> <head> <meta charset="ascii"> </head> <body> <div title="2 > 1, 1 > 0, à vs è"> 3 > 2, 1 > 0 </div> </body> which want parse means of jsoup.
i need extract value of title attribite is: 2 > 1, 1 > 0, à vs è.
i've create document object doc below (it in kotlin, don't think important here):
val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.outputsettings().charset(charset) when print out doc means of
println(doc.tostring()) i following string
<!doctype html> <html> <head> <meta charset="ascii"> </head> <body> <div title="2 > 1, 1 > 0, à vs è"> 3 > 2 </div> </body> </html> which differs file content title attribute value (> gets transformed > in string "2 > 1"), while rest of document ok.
then, inspecting attribute value
doc.body().select("div").foreach { div -> println("title = ${div.attr("title")}") } produces following string
title = 2 > 1, 1 > 0, à vs è notice, à , è transformed à , è.
my question is: in jsoup, how can attribute values of html tags preserving way written in input file?
in example above need string "2 > 1, 1 > 0, à vs è" (as written in input file) , not "2 > 1, 1 > 0, à vs è" niether "2 > 1, 1 > 0, à vs è".
the attr() method returns string without html entities , not find way keep html entities. however, can use jsoup.clean() method convert characters in string entities.
val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.body().select("div").foreach { div -> val title = jsoup.clean("${div.attr("title")}", "", whitelist.none(), document.outputsettings().charset(charset)) println("title = $title") } the result is:
title = 2 > 1, à vs è of course, might not solution use case.
Comments
Post a Comment