Isn't the size of character in Java 2 bytes?
A char
represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[])
constructor you ask Java to convert the byte[]
to a String
using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String
containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[]
and char[]
/String
or between InputStream
and Reader
or between OutputStream
and Writer
, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char
represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java : Char vs String byte size
getBytes()
outputs the String
with the default encoding (most likely ISO-8859-1
) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.
Is there a difference in SIZE between String a and a Char 'a'?
Note there are actually 3 cases
String "a"
Character 'a'
char 'a'
of these, char 'a' will take up the least amount of space (2 bytes), whereas Char 'a' and String "a" are both objects and so because of the memory overhead associated with an Object, both will be approximately the same size
If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?
It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).
0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.
How that certain number is encoded, might take up more bytes than the raw code point would.
Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1
bits followed by a 0
bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10
.
Hex 0x2124 = binary 100001 00100100
According to abovementioned rules, this converts to the following UTF-8 encoding:
11100010 10000100 10100100 <-- Our UTF-8 encoded result
^ ^ ^ ^ ^ ^ ^
AaaaBbDd CcDddddd CcDddddd <-- Some notes, explained below
A
is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three1
s = three bytes).B
is padding, because otherwise the total number of bits is not divisible by 8.C
is the concatenation bits (each subsequent byte starting with10
).D
is the actual bits of our code point.
So indeed, the character ℤ takes up three bytes.
Is a character 1 byte or 2 bytes in Java?
(I think by "none string part" you are referring to the bytes that ObjectOutputStream emits when you create it. It is possible you don't want to use ObjectOutputStream, but I don't know your requirements.)
Just FYI, Unicode and UTF-8 are not the same thing. Unicode is a standard that specifies, amongst other things, what characters are available. UTF-8 is a character encoding that specifies how these characters shall be physically encoded in 1s and 0s. UTF-8 can use 1 byte for ASCII (<= 127) and up to 4 bytes to represent other Unicode characters.
UTF-8 is a strict superset of ASCII. So even if you specify a UTF-8 encoding for a file and you write "abcd" to it, it will contain just those four bytes: they have the same physical encoding in ASCII as they do in UTF-8.
Your method uses ObjectOutputStream
which actually has a significantly different encoding than either ASCII or UTF-8! If you read the Javadoc carefully, if obj
is a string and has already occurred in the stream, subsequent calls to writeObject
will cause a reference to the previous string to be emitted, potentially causing many fewer bytes to be written in the case of repeated strings.
If you're serious about understanding this, you really should spend a good amount of time reading about Unicode and character encoding systems. Wikipedia has an excellent article on Unicode as a start.
Related Topics
Dynamic Graphics Object Painting
Changing the Shapes of Points in Scatter Plot
Jdk Dateformatter Parsing Dayofweek in German Locale, Java8 VS Java9
Migration from Struts 1 to Struts 2
How to Display Legend for Pie Chart in Columns
Change Background Color of One Cell in Jtable
Autocomplete with Java , Redis, Elastic Search , Mongo
How to Setsize of Image Using Rescaleop
Why Use a Reentrantlock If One Can Use Synchronized(This)
Jax-Rs - How to Return JSON and Http Status Code Together
Difference Between List, List<>, List<T>, List<E>, and List<Object>
Difference in System. Exit(0) , System.Exit(-1), System.Exit(1 ) in Java
Why Shouldn't You Extend Jframe and Other Components