How to Transform the Utf8 Chars to Iso8859-1

Convert utf8-characters to iso-88591 and back in PHP

Have a look at iconv() or mb_convert_encoding().
Just by the way: why don't utf8_encode() and utf8_decode() work for you?

utf8_decode — Converts a string with
ISO-8859-1 characters encoded with
UTF-8 to single-byte ISO-8859-1

utf8_encode — Encodes an ISO-8859-1
string to UTF-8

So essentially

$utf8 = 'ÄÖÜ'; // file must be UTF-8 encoded
$iso88591_1 = utf8_decode($utf8);
$iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $utf8);
$iso88591_2 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');

$iso88591 = 'ÄÖÜ'; // file must be ISO-8859-1 encoded
$utf8_1 = utf8_encode($iso88591);
$utf8_2 = iconv('ISO-8859-1', 'UTF-8', $iso88591);
$utf8_2 = mb_convert_encoding($iso88591, 'UTF-8', 'ISO-8859-1');

all should do the same - with utf8_en/decode() requiring no special extension, mb_convert_encoding() requiring ext/mbstring and iconv() requiring ext/iconv.

Java: how to undo conversion from UTF-8 to ISO-8859-1

Suppose we have a string containing double iso-8859-1 characters, such as é.

To convert double iso-8859-1 to UTF-8 characters, we can use this constructor of String. Pass an array of byte and a CharSet object. The class java.nio.charset.StandardCharsets provides constants for various CharSet objects.

String accentE = 
new String(
"é".getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8
)
;

which is é

See this code run live at IdeOne.com.

How do I convert between ISO-8859-1 and UTF-8 in Java?

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

How can i transform the utf8 chars to iso8859-1

~ UPDATE ~

ruby-iconv has been superseded from Ruby 1.9.3 onwards by the encode method. See
Jörg W Mittag's answer for details, but in short:

utf8string = "èàòppè"
iso_string = utf8string.encode('ISO-8859-1')

I agree with Williham Totlandt in thinking that this type of conversion might not be the smartest idea ever, but anyway: use ruby-iconv :)

utf8string = "èàòppè"
iso_string = Iconv.conv 'iso8859-1', 'UTF-8', utf8string

Convert character from UTF-8 to ISO-8859-1 manually

The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.

In Unicode, every character ("code point") has a unique number assigned to it. The character ö is assigned the code point U+00F6, which is F6 in hexadecimal, and 246 in decimal.

UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.

If you do that transformation, you will see that U+00F6 transforms to the UTF-8 sequence C3 B6, or 1100 0011 1011 0110 in binary, which is why that is the UTF-8 representation of ö.

The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö is F6 in Latin-1.

Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.

See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.

Converting UTF-8 to ISO-8859-1 in Java

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
private HtmlEncoder() {}

public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}

Example usage:

String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as “. A couple of other arbitrary code points are likewise encoded.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

How to convert String with “ (ISO-8859-1) characters to normal (UTF-8)characters?

$final = '<li>Jain R.K. and Iyengar S.R.K., “Advanced Engineering Mathematicsâ€, Narosa Publications,</li>';

$final = str_replace("Â", "", $final);
$final = str_replace("’", "'", $final);
$final = str_replace("“", '"', $final);
$final = str_replace('–', '-', $final);
$final = str_replace('â€', '"', $final);

for past datas, i replaced the weird characters with UTF-8 characters.

for future datas, i made the charset to utf8 in php, html and databases connections.

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "åäö" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF8 characters.

escape encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx (%u followed by four-digit hex.) For example, escape("å") == "%E5" and escape("あ") == "%u3042".

encodeURIComponent percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5" and encodeURIComponent("あ") == "%E3%81%82".

So you can do:

fixedstring = decodeURIComponent(escape(utfstring));

For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5" which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

If you'd need to do the reverse for some reason, that works too:

utfstring = unescape(encodeURIComponent(originalstring));

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

var fixedstring;

try{
// If the string is UTF-8, this will work and not throw an error.
fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
// If it isn't, an error will be thrown, and we can assume that we have an ISO string.
fixedstring=badstring;
}


Related Topics



Leave a reply



Submit