Why Does "String".Getbytes() Work Different According to the Operation System

why does STRING .getBytes() work different according to the Operation System

You are not specifying a charset when calling getBytes(), so it uses the default charset of the underlying platform (or of Java itself if specified when Java is started). This is stated in the String documentation:

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

getBytes() has an overloaded version that lets you specify a charset in your code.

public byte[] getBytes(Charset charset)

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

Java String.getBytes(charset) and new String(bytes, charset) with two different character sets

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string.

That’s what you should aim at, to write correct code. But this does not imply that every wrong operation will always produce wrong results. A simple example would be a string consisting of ASCII letters only. A lot of encodings produce the same byte sequence for such a string, so a test using only such a string is not sufficient to spot encoding related errors.

As you can see, I figure out the way of getting the original string:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,x-windows-949] = 테스트

How can it be possible?
How can the string be encoded and decoded properly as different character sets?

Well, when I execute

System.out.println(Charset.forName("euc-kr") == Charset.forName("ksc5601"));

on my machine, it prints true. Or, if I execute

System.out.println(Charset.forName("euc-kr").aliases());

it prints

[ksc5601-1987, csEUCKR, ksc5601_1987, ksc5601, 5601, euc_kr, ksc_5601, ks_c_5601-1987, euckr]

So for euc-kr and ksc5601, the answer is simple. These are different names for the same character encoding.

For x-windows-949, I have to resort to Wikipedia:

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).

So it is an extension of ksc5601 which will lead to the same result, as long as you’re not using any characters affacted by the extension (think of the ASCII example above).

Generally, this does not invalidate you premise. Correct results are only guaranteed when using the same encoding for both sides. It just means, testing code is much harder, as it requires sufficient test input data to spot errors. E.g. a common error in the Western world, is to confuse iso-latin-1 (ISO 8859-1) with Windows codepage 1252, which may not get spotted with simple text.

String.getBytes() returns different values for multiple execution?

System.out.println("file data:" + sigToVerify);

Here you are not printing the value of a String. As owlstead pointed out correctly in the comments, the Object.toString() method will be invoked on the byte array sigToVerify. Leading to an output of this format:

getClass().getName() + '@' + Integer.toHexString(hashCode())

If you want to print each element in the array you have to loop through it.

byte[] bytes = "i love my country".getBytes();
for(byte b : bytes) {
System.out.println("byte = " + b);
}

Or even simpler, use the Arrays.toString() method:

System.out.println(Arrays.toString(bytes));

Byte arrays and strings in java

It depends on your Charset.defaultCharset(). That determines how the bytes are interpreted. Probably the negative values are a non-canonical way of representing codepoints.

(see this great answer: https://stackoverflow.com/a/7934397/461499)

Re-interpreting the getBytes() to a String will then be the canonical way and will return true

    System.out.println(Charset.defaultCharset()); //UTF-8 here :)

byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
String s= new String(arr);
System.out.println(s);
// [56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67]

byte arr2[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67};
System.out.println(Arrays.toString(s.getBytes()));
System.out.println(Arrays.equals(arr, s.getBytes())); // returns false

String s2= new String(arr2);
System.out.println(Arrays.toString(s2.getBytes()));
System.out.println(Arrays.equals(arr2, s2.getBytes())); // returns true

Get size of String w/ encoding in bytes without converting to byte[]

Simple, just write it to a dummy output stream:

class CountingOutputStream extends OutputStream {
private int _total;

@Override public void write(int b) {
++_total;
}

@Override public void write(byte[] b) {
_total += b.length;
}

@Override public void write(byte[] b, int offset, int len) {
_total += len;
}

public int getTotalSize(){
_total;
}
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
int end = Math.min(myString.length(), i+8096);
writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".

(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

Transform a string into an array of bytes

.NET actually uses UTF-16 encoding to store string's and char's, which means each character is actually encoded with 2 bytes. This is detailed in Character Encoding in the .NET Framework:

UTF-16 encoding is used by the common language runtime to represent Char and String values, and it is used by the Windows operating system to represent WCHAR values.

So you should expect to get 2 bytes for every character in your string.

If you want to only get 1 byte for per character you have to use a different encoding. For this input, ASCII encoding will work:

public byte[] GetBytes(string str)
{
return System.Text.Encoding.ASCII.GetBytes(str);
}

Calling this with the input "TEST" will return { 84, 69, 83, 84 }



Related Topics



Leave a reply



Submit