why does STRING .getBytes() work different according to the Operation System
You are not specifying a charset when calling getBytes()
, so it uses the default charset of the underlying platform (or of Java itself if specified when Java is started). This is stated in the String
documentation:
public byte[] getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes()
has an overloaded version that lets you specify a charset in your code.
public byte[] getBytes(Charset charset)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Java String.getBytes(charset) and new String(bytes, charset) with two different character sets
According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string.
That’s what you should aim at, to write correct code. But this does not imply that every wrong operation will always produce wrong results. A simple example would be a string consisting of ASCII letters only. A lot of encodings produce the same byte sequence for such a string, so a test using only such a string is not sufficient to spot encoding related errors.
As you can see, I figure out the way of getting the original string:
[iso-8859-1,euc-kr] = 테스트
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,x-windows-949] = 테스트
How can it be possible?
How can the string be encoded and decoded properly as different character sets?
Well, when I execute
System.out.println(Charset.forName("euc-kr") == Charset.forName("ksc5601"));
on my machine, it prints true
. Or, if I execute
System.out.println(Charset.forName("euc-kr").aliases());
it prints
[ksc5601-1987, csEUCKR, ksc5601_1987, ksc5601, 5601, euc_kr, ksc_5601, ks_c_5601-1987, euckr]
So for euc-kr
and ksc5601
, the answer is simple. These are different names for the same character encoding.
For x-windows-949
, I have to resort to Wikipedia:
Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).
So it is an extension of ksc5601
which will lead to the same result, as long as you’re not using any characters affacted by the extension (think of the ASCII example above).
Generally, this does not invalidate you premise. Correct results are only guaranteed when using the same encoding for both sides. It just means, testing code is much harder, as it requires sufficient test input data to spot errors. E.g. a common error in the Western world, is to confuse iso-latin-1 (ISO 8859-1) with Windows codepage 1252, which may not get spotted with simple text.
String.getBytes() returns different values for multiple execution?
System.out.println("file data:" + sigToVerify);
Here you are not printing the value of a String
. As owlstead pointed out correctly in the comments, the Object.toString() method will be invoked on the byte array sigToVerify
. Leading to an output of this format:
getClass().getName() + '@' + Integer.toHexString(hashCode())
If you want to print each element in the array you have to loop through it.
byte[] bytes = "i love my country".getBytes();
for(byte b : bytes) {
System.out.println("byte = " + b);
}
Or even simpler, use the Arrays.toString()
method:
System.out.println(Arrays.toString(bytes));
Byte arrays and strings in java
It depends on your Charset.defaultCharset()
. That determines how the bytes are interpreted. Probably the negative values are a non-canonical way of representing codepoints.
(see this great answer: https://stackoverflow.com/a/7934397/461499)
Re-interpreting the getBytes()
to a String
will then be the canonical way and will return true
System.out.println(Charset.defaultCharset()); //UTF-8 here :)
byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
String s= new String(arr);
System.out.println(s);
// [56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67]
byte arr2[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67};
System.out.println(Arrays.toString(s.getBytes()));
System.out.println(Arrays.equals(arr, s.getBytes())); // returns false
String s2= new String(arr2);
System.out.println(Arrays.toString(s2.getBytes()));
System.out.println(Arrays.equals(arr2, s2.getBytes())); // returns true
Get size of String w/ encoding in bytes without converting to byte[]
Simple, just write it to a dummy output stream:
class CountingOutputStream extends OutputStream {
private int _total;
@Override public void write(int b) {
++_total;
}
@Override public void write(byte[] b) {
_total += b.length;
}
@Override public void write(byte[] b, int offset, int len) {
_total += len;
}
public int getTotalSize(){
_total;
}
}
CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);
// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
int end = Math.min(myString.length(), i+8096);
writer.write(myString, i, end - i);
}
writer.flush();
System.out.println("Total bytes: " + cos.getTotalSize());
it's not only simple, but probably just as fast as the other "complex" answers.
How do I get a consistent byte representation of strings in C# without manually specifying an encoding?
Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!
Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)
For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.
Just do this instead:
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.
Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!
It will be encoded and decoded just the same, because you are just looking at the bytes.
If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.
Transform a string into an array of bytes
.NET actually uses UTF-16 encoding to store string
's and char
's, which means each character is actually encoded with 2 bytes. This is detailed in Character Encoding in the .NET Framework:
UTF-16 encoding is used by the common language runtime to represent
Char
andString
values, and it is used by the Windows operating system to representWCHAR
values.
So you should expect to get 2 bytes for every character in your string.
If you want to only get 1 byte for per character you have to use a different encoding. For this input, ASCII encoding will work:
public byte[] GetBytes(string str)
{
return System.Text.Encoding.ASCII.GetBytes(str);
}
Calling this with the input "TEST"
will return { 84, 69, 83, 84 }
Related Topics
Onclicklistener Not Responding
What Is the Jasperrepots-Fonts Jar for and How to Use It
How to Use JavaScript with Selenium Webdriver Java
Calculating Distance Between Two Points, Using Latitude Longitude
Createprocess Error=206, the Filename or Extension Is Too Long When Running Main() Method
Differencebetween Field, Variable, Attribute, and Property in Java Pojos
Different Dependencies for Different Build Profiles
Update Jlabel Repeatedly with Results of Long Running Task
Should One Call .Close() on Httpservletresponse.Getoutputstream()/.Getwriter()
How to Generate Random Number in Specific Range in Android
How to Check If Current Thread Is Not Main Thread
Spring JSON Request Getting 406 (Not Acceptable)
Converting a Date String to a Datetime Object Using Joda Time Library
Java Count Occurrence of Each Item in an Array