Encode String to Utf-8

Encode String to UTF-8

String objects in Java use the UTF-16 encoding that can't be modified*.

The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).

* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).

Encode a String to UTF-8 in Perl

You need utf::encode, not decode. Both of them change the argument in place and return nothing, so there's no point in assigning the return value to the variable.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use JSON;

my $data = qq({"cat":"Büster"});
utf8::encode($data);
$data = JSON::decode_json($data);
binmode *STDOUT, ':encoding(UTF-8)';
print $data->{cat};

Morover, the output filehandle needs to know what encoding it should use, that's what the binmode does.

Also, make sure you save the source in the UTF-8 encoding.

String encoding (UTF-8) JAVA

According to the javadoc of String#getBytes(String charsetName):

Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.

And the documentation of String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using
the specified charset.

Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:

// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));

encoding string to utf-8 leaves non english characters as byte strings

print(tweet.content.encode('utf-8')) writes a byte string (data, not text) in a human-readable, ASCII-compatible form (leading b to represent a byte string, non-ASCII byte values >127 represented as hexadecimal escape codes \xNN) and is not what you want.

If using output redirection, Python can be told what encoding to use to convert text to a byte stream suitable for a file using an environment variable:

set PYTHONIOENCODING=utf8
python main.py > output.txt

You can also write the data directly to a file specifying the encoding instead of using redirection:

with open('tweet.txt','w',encoding='utf8') as f:
f.write(tweet.content)

Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

Try this:

import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

Explanation:

We use the regular expression rb'\\([0-7]{3})' (which matches a literal backslash \ followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]), interpreting that as a number written in octal (int(_, 8)), and then replacing the original escape sequence with a single byte (bytes([_])).

We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.

Encode a string in UTF-8

Try this instead:

$enc = [System.Text.Encoding]::UTF8
$consumerkey ="xvz1evFS4wEEPTGEFPHBog"
$encconsumerkey= $enc.GetBytes($consumerkey)

Encode String to UTF-8 in Kotlin

Kotlin has an overload of ByteArray.toString accepting a Charset. All you need to do is use it: array.toString(charset).

I cannot find a section in the documentation specifying that ByteArray.toString() does the right thing, as it doesn't in Java and that behavior probably is preserved in Kotlin. I would guess it does the wrong thing. I recommend using toString(charset) explicitly.

Preventing Python requests.post to encode strings to UTF-8

If for some reason the target API is not fully JSON-compliant, you can build a JSON response manually and encode it in whatever encoding you like. ensure_ascii=False wil disable non-ASCII translation to escape codes, and you can specify the encoding if it is non-standard. The wireshark screenshot shows the data is actually UTF-8-encoded, so that is what I've done below:

import requests
import json

payload = {
"type": "send-message",
"username": "myuser",
"password": "mypass",
"to": "456",
"msg": "here are accents: é ç"
}

headers = {'Content-Type': 'application/json'}
data = json.dumps(payload, ensure_ascii=False).encode('utf8')
resp = requests.post("http://192.168.1.10/send_message.html", data=data, headers=headers)


Related Topics



Leave a reply



Submit