How to Convert Utf8 String to Byte Array

How to convert UTF8 string to byte array?

The logic of encoding Unicode in UTF-8 is basically:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • Characters up to U+007F are encoded with a single byte.
  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

function toUTF8Array(str) {
var utf8 = [];
for (var i=0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6),
0x80 | (charcode & 0x3f));
}
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
// surrogate pair
else {
i++;
// UTF-16 encodes 0x10000-0x10FFFF by
// subtracting 0x10000 and splitting the
// 20 bits of 0x0-0xFFFFF into two halves
charcode = 0x10000 + (((charcode & 0x3ff)<<10)
| (str.charCodeAt(i) & 0x3ff));
utf8.push(0xf0 | (charcode >>18),
0x80 | ((charcode>>12) & 0x3f),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
}
return utf8;
}

How to convert Strings to and from UTF8 byte arrays in Java

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);

You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.

How to convert utf8 string to []byte?

This question is a possible duplicate of How to assign string to bytes array, but still answering it as there is a better, alternative solution:

Converting from string to []byte is allowed by the spec, using a simple conversion:

Conversions to and from a string type

[...]


  1. Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.

So you can simply do:

s := "some text"
b := []byte(s) // b is of type []byte

However, the string => []byte conversion makes a copy of the string content (it has to, as strings are immutable while []byte values are not), and in case of large strings it's not efficient. Instead, you can create an io.Reader using strings.NewReader() which will read from the passed string without making a copy of it. And you can pass this io.Reader to json.NewDecoder() and unmarshal using the Decoder.Decode() method:

s := `{"somekey":"somevalue"}`

var result interface{}
err := json.NewDecoder(strings.NewReader(s)).Decode(&result)
fmt.Println(result, err)

Output (try it on the Go Playground):

map[somekey:somevalue] <nil>

Note: calling strings.NewReader() and json.NewDecoder() does have some overhead, so if you're working with small JSON texts, you can safely convert it to []byte and use json.Unmarshal(), it won't be slower:

s := `{"somekey":"somevalue"}`

var result interface{}
err := json.Unmarshal([]byte(s), &result)
fmt.Println(result, err)

Output is the same. Try this on the Go Playground.

Note: if you're getting your JSON input string by reading some io.Reader (e.g. a file or a network connection), you can directly pass that io.Reader to json.NewDecoder(), without having to read the content from it first.

How to convert UTF-8 byte[] to string

string result = System.Text.Encoding.UTF8.GetString(byteArray);

How to convert utf8 string to utf8 byte array?

Can use other option again:

string value = "\u00C4 \uD802\u0033 \u00AE";    
byte[] bytes= System.Text.Encoding.UTF8.GetBytes(value);

For more information can look on Encoding.UTF8 Property

How can i encode a string to UTF-8 to a pre existing byte array?

GetBytes has another overload that writes to existing array:

  byte[] bytes = new byte[1000]; // sample, make sure it has enough space
var specificIndex = 0;
var actualByteCount = Encoding.UTF8.GetBytes(
myString, 0, myString.Length, bytes, specificIndex);

Don't forget to handle result to know how many bytes in the array actually represent string (actualByteCount)

Note you may need to use GetByteCount to get correct array size or adjust number of characters to convert to fit into your buffer.

Java: convert UTF8 String to byte array in another encoding

There is no such thing as an "UTF8 encoded String" in Java. Java Strings use UTF-16 internally, but should be seen as an abstraction without a specific encoding. If you have a String, it's already decoded. If you want to encode it, use string.getBytes(encoding). If you original data is UTF-8, you have to take that into account when you convert that data from bytes to String.

String to byte array in UTF-8?

A function like this will do what you need:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s));
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
end;

You can call it with any type of string and the RTL will convert from the encoding of the string that is passed to UTF-8. So don't be tricked into thinking you must convert to UTF-8 before calling, just pass in any string and let the RTL do the work.

After that it's a fairly standard array copy. Note the assertion that explicitly calls out the assumption on string element size for a UTF-8 encoded string.

If you want to get the zero-terminator you would write it so:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s)+1);
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
Result[high(Result)] := 0;
end;

C# - Convert UTF8 String into bits, modify bits, and convert back into UTF8 String

If you're trying to read a file into a byte[] array, modify those bytes, and convert that back to a string, you could do something like this:

// read the file into a byte array
var bytes = File.ReadAllBytes(inputFileName);

// modify the bytes

// now convert back to a UTF string
var stringFromByteArray = Encoding.UTF8.GetString(bytes, 0, bytes.Length);


Related Topics



Leave a reply



Submit