Isn't the Size of Character in Java 2 Bytes

Isn't the size of character in Java 2 bytes?

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).

That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).

When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.

If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).

That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.

(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.

(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)

Java : Char vs String byte size

getBytes() outputs the String with the default encoding (most likely ISO-8859-1) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.

Is there a difference in SIZE between String a and a Char 'a'?

Note there are actually 3 cases

String "a"

Character 'a'

char 'a'

of these, char 'a' will take up the least amount of space (2 bytes), whereas Char 'a' and String "a" are both objects and so because of the memory overhead associated with an Object, both will be approximately the same size

If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?

It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

How that certain number is encoded, might take up more bytes than the raw code point would.


Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

Hex 0x2124 = binary 100001 00100100

According to abovementioned rules, this converts to the following UTF-8 encoding:

11100010 10000100 10100100    <-- Our UTF-8 encoded result
^ ^ ^ ^ ^ ^ ^
AaaaBbDd CcDddddd CcDddddd <-- Some notes, explained below
  • A is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1s = three bytes).
  • B is padding, because otherwise the total number of bits is not divisible by 8.
  • C is the concatenation bits (each subsequent byte starting with 10).
  • D is the actual bits of our code point.

So indeed, the character ℤ takes up three bytes.

Is a character 1 byte or 2 bytes in Java?

(I think by "none string part" you are referring to the bytes that ObjectOutputStream emits when you create it. It is possible you don't want to use ObjectOutputStream, but I don't know your requirements.)

Just FYI, Unicode and UTF-8 are not the same thing. Unicode is a standard that specifies, amongst other things, what characters are available. UTF-8 is a character encoding that specifies how these characters shall be physically encoded in 1s and 0s. UTF-8 can use 1 byte for ASCII (<= 127) and up to 4 bytes to represent other Unicode characters.

UTF-8 is a strict superset of ASCII. So even if you specify a UTF-8 encoding for a file and you write "abcd" to it, it will contain just those four bytes: they have the same physical encoding in ASCII as they do in UTF-8.

Your method uses ObjectOutputStream which actually has a significantly different encoding than either ASCII or UTF-8! If you read the Javadoc carefully, if obj is a string and has already occurred in the stream, subsequent calls to writeObject will cause a reference to the previous string to be emitted, potentially causing many fewer bytes to be written in the case of repeated strings.

If you're serious about understanding this, you really should spend a good amount of time reading about Unicode and character encoding systems. Wikipedia has an excellent article on Unicode as a start.



Related Topics



Leave a reply



Submit