How to Fix Invalid Byte 1 of 1-Byte Utf-8 Sequence

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

How to fix this issue ?

Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.

To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

I have UTF-8 - but still get Invalid byte 1 of 1-byte UTF-8 sequence

If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.

The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).

Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.

EDIT: Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.

I looked up the code to find the actual problem in Cocoon-stax:

if (XMLGenerator.this.logger.isDebugEnabled()) {
    XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();

As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.

As a workaround, one can use the following:

new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
       "UTF-8");

This will trigger the right internal actions (as Christian found out by experimenting with the API).

I've opened an issue in Apache's bug tracker.

EDIT 2: The issue is fixed and will be included in an upcoming release.

Message: Invalid byte 1 of 1-byte UTF-8 sequence in hadoop

I suspect this is the problem - it's at least a problem:

XMLStreamReader reader =
    XMLInputFactory.newInstance().createXMLStreamReader(new
        ByteArrayInputStream(document.getBytes()));

That call to getBytes will use the platform default encoding, rather than UTF-8.

You could specify "utf-8" as the encoding name - but it would be simpler to create a StringReader:

XMLStreamReader reader = XMLInputFactory.newInstance()
    .createXMLStreamReader(new StringReader(document));

Of course that may not be the only error, but it's at least something to look at.

Invalid byte 1 of 1-byte UTF-8 sequence occurs when posting xml in .jar but not in eclpise

You need to choose the encoding used by your PrintWriter. Outside of Eclipse, your platform is presumably defaulting to something other than UTF-8.

Try this code:

PrintWriter pw = new PrintWriter(new OutputStreamWriter(
    conn.getOutputStream(), "UTF-8"));

How to Fix Invalid Byte 1 of 1-Byte Utf-8 Sequence