Jsoup Character Encoding Issue

JSoup character encoding issue

The charset attribute is missing in HTTP response Content-Type header. Jsoup will resort to platform default charset when parsing the HTML. The Document.OutputSettings#charset() won't work as it's used for presentation only (on html() and text()), not for parsing the data (in other words, it's too late already).

You need to read the URL as InputStream and manually specify the charset in Jsoup#parse() method.

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

this results here in

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.

jsoup and character encoding

The docs are out of date / incomplete. Jsoup does use the charset meta tag, as well as the http-equiv tag to detect the charset. From the source, we see that this method looks like this:

public static Document parse(File in, String charsetName) throws IOException {
    return DataUtil.load(in, charsetName, in.getAbsolutePath());
}

DataUtil.load in turn calls parseByteData(...), which looks like this: (Source, scroll down)

//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
  String docData;
  Document doc = null;

   if (charsetName == null) { // determine from meta. safe parse as UTF-8
    // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
    docData = Charset.forName(defaultCharset).decode(byteData).toString();
    doc = parser.parseInput(docData, baseUri);
    Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
    if (meta != null) { // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta.hasAttr("http-equiv")) {
            foundCharset = getCharsetFromContentType(meta.attr("content"));
        }
        if (foundCharset == null && meta.hasAttr("charset")) {
            try {
                if (Charset.isSupported(meta.attr("charset"))) {
                    foundCharset = meta.attr("charset");
                }
            } catch (IllegalCharsetNameException e) {
                foundCharset = null;
            }
        }

        (Snip...)

The following line from the above code snippet shows us that indeed, it uses either meta[http-equiv=content-type] or meta[charset] to detect the encoding, otherwise falling back to utf8.

Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();

I'm not quite sure what you mean here, but no, the output charset setting controls what characters are escaped when the document HTML / XML is printed to string, whereas the input charset determines how the file is read.

It will only ever remove meta[name=charset] items. From the source, the method which updates / removes the charset definition in the document: (Source, again scroll down)

private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
    OutputSettings.Syntax syntax = outputSettings().syntax();

    if (syntax == OutputSettings.Syntax.html) {
        Element metaCharset = select("meta[charset]").first();

        if (metaCharset != null) {
            metaCharset.attr("charset", charset().displayName());
        } else {
            Element head = head();

            if (head != null) {
                head.appendElement("meta").attr("charset", charset().displayName());
            }
        }

        // Remove obsolete elements
        select("meta[name=charset]").remove();
    } else if (syntax == OutputSettings.Syntax.xml) {
    (Snip..)

Essentially, if you call charset(...) and it does not have a charset meta tag, it will add one, otherwise update the existing one. It does not touch the http-equiv tag.

If you want to find out if the documet specifies an encoding, just look for http-equiv charset or meta charset tags, and if there are no such tags, this means that the document does not specify an encoding.

Jsoup is opens source, you can look at the source yourself to see exactly how it works: https://github.com/jhy/jsoup/ (You can also modify it to do exactly what you want!)

I'll update this answer with further details when I have time. Let me know if you have any other questions.

Android/ Jsoup: how to fix encoding issues

I will write the remainder of this answer about Character Sets in Portuguese, Spanish (And Chinese) in just a second... First, though, let me say that the page you are trying to read - actually loads the contents of the page using "AJAX / JS". I can download AJAX using my own library available on the Internet, but other tools like Selenium, Puppeteer, or Splash would be necessary. Without mentioning character sets, how are you downloading the contents of your "Brazilian Constitution" to HTML in the first place? When I try a straight HTML downloader (no script execution), I get a pile of Java-Script without any Portuguese at all - and it looks nothing like the HTML posted in your question. :)

If you are already downloading the HTML, and only have a problem with the character set, read the answer below. If you have been unable to download anything but the AJAX / JavaScript calls - I can post another answer that explains executing JS / AJAX in one or two lines in a different answer. (Essentially, what you posted isn't the same output that I'm getting).

In 99.9999% of the cases, if it is not straight up "ASCII" (because it has foreign language characters), then it is (almost) guaranteed to be readable using a version of "UTF-8" Character-Set. I translate Spanish News Articles and also Chinese News Articles - and UTF-8 always works for me. I had one Spanish Site that expected an encoding called "iso8859-1", but other than the "Don Quijote de La Mancha" site where I found it - UTF8 works.

To tell you the truth, it is never an issue at all because when reading a web-page (as opposed to writing one), Java has automatically parsed the text as if it were UTF-8 without any configurations whatsoever. Here is the "Open Connection" Method Body from a library I have written:

HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
con.setRequestMethod                        ("GET");
if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
return new BufferedReader                   (new InputStreamReader(con.getInputStream()));

Here is the method body of a "Scrape Contents" method from my library:

URL url = new URL("http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm");
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + "\n");
FileRW.writeFile(sb.toString(), "page.html");

I don't know the first thing about Microsoft Character Sets, to be fully honest with you. I have coded in UNIX, and I have never worried about any character sets - other than to make sure that when writing HTML (as opposed to Reading HTML), that the an HTML <META CHARSET="utf-8"> element is inserted into my pages.

JSoup encoding issue with numeric character references

Try calling the following before text():

document.outputSettings().charset("windows-1252");

For more output settings see the javadoc.

jsoup output encoding issue

Try settingdoc.outputSettings().escapeMode(EscapeMode.xhtml) or changing the output charset before printing.

See also the (paltry) documentation for EscapeMode.