java - charset of the attribute values in Jsoup -
i use jsoup , need pick attribute values of tags inside html document in ascii-encoding maintaining them are, without converting them.
so, have following html document
<!doctype html> <head> <meta charset="ascii"> </head> <body> <div title="2 > 1, 1 > 0, à vs è"> 3 > 2, 1 > 0 </div> </body>
which want parse means of jsoup.
i need extract value of title
attribite is: 2 > 1, 1 > 0, à vs è
.
i've create document
object doc
below (it in kotlin, don't think important here):
val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.outputsettings().charset(charset)
when print out doc means of
println(doc.tostring())
i following string
<!doctype html> <html> <head> <meta charset="ascii"> </head> <body> <div title="2 > 1, 1 > 0, à vs è"> 3 > 2 </div> </body> </html>
which differs file content title
attribute value (>
gets transformed >
in string "2 > 1"
), while rest of document ok.
then, inspecting attribute value
doc.body().select("div").foreach { div -> println("title = ${div.attr("title")}") }
produces following string
title = 2 > 1, 1 > 0, à vs è
notice, à
, è
transformed à
, è
.
my question is: in jsoup, how can attribute values of html tags preserving way written in input file?
in example above need string "2 > 1, 1 > 0, à vs è"
(as written in input file) , not "2 > 1, 1 > 0, à vs è"
niether "2 > 1, 1 > 0, à vs è"
.
the attr()
method returns string without html entities , not find way keep html entities. however, can use jsoup.clean()
method convert characters in string entities.
val charset = charset.forname("ascii") val doc = jsoup.parse(file("test.html").readtext(charset)) doc.body().select("div").foreach { div -> val title = jsoup.clean("${div.attr("title")}", "", whitelist.none(), document.outputsettings().charset(charset)) println("title = $title") }
the result is:
title = 2 > 1, à vs è
of course, might not solution use case.
Comments
Post a Comment