xml - How to fix encoding errors programmatically in XSLT -
i trying batch process thousands of xml files command line, getting various error messages relating invalid characters.
so far, have been able fix in 2 different ways:
- opening offending file in notepad , going save > utf-8
- adding encoding xml declaration (for reason
iso-8859-1 works)
i puzzled why getting these error messages. can see no mention of encoding in original xml or dtd, xml not claiming it's not.
given number of files processed, finding labourious fix each file individually. wondering if there way fix programmatically, example in xslt stylesheet?
the error message is:
error on line 80 column 128 of 12345.dxl: sxxp0003: error reported xml parser: invalid byte 1 of 1-byte utf-8 sequence.
column 128 line 80 seems correspond missing single curly quote: ("this governments local services realignment exercise").
i tried adding character map xslt, still same error:
<xsl:output method="text" omit-xml-declaration="yes" indent="no" use-character-maps="curly_quotes"/> <xsl:character-map name="curly_quotes"> <xsl:output-character character="’" string="‘"/> <xsl:output-character character="“" string="’"/> <xsl:output-character character="”" string="“"/> <xsl:output-character character="–" string="”"/> </xsl:character-map>
to answer question posed: in general (but see below), 1 cannot fix encoding errors programmatically in xslt, because xslt acts upon parsed xml document, , encoding errors typically prevent document being parsed correctly, strictly speaking means there is no xml document present, stream of octets fails manage xml well-formedness.
as @nwellnhof points out, tool use character-set converter iconv.
note while in general case documents encoding errors or inaccurate encoding declarations won't make through xml parsing phase, there can exceptions: not errors in encoding declaration reliably detectable. if, example, 1 had batch of documents labeled being in iso 8859-1, although in fact in iso 8859-15 (or, think, pretty other part of iso 8859), it's unlikely xml parser detect error; xslt stylesheet performs near-identity transform , writes input out desired encoding declaration fix such error. special case. further discussion (for enjoy sort of question) may found @ http://cmsmcq.com/2007/dialog.surrogates.xml
Comments
Post a Comment