Bytes of a String in Java

How to count String bytes properly?

The word endereço should return me length 9 instead of 8.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

but String length method or getting the length of it with the byte
array returned from getBytes method doesn't return special chars
counted as two bytes.


String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc :

Returns the length of this string. The length is equal to the number
of Unicode code units in the string.


The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.

So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.

Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.

At last, if the String cannot be encoded in the default charset, the behavior is unspecified.

So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc :

Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the
default charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().


Solution

byte[] getBytes() method comes with two other very useful overloads :

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

  • byte[] java.lang.String.getBytes(Charset charset)

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the
given charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.

This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
java.nio.charset.CharsetEncoder class should be used when more control
over the encoding process is required.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

And you will get a length of 9 bytes as you wish.

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);

// String sample
String yourString = "endereço";

// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

Output on my machine that is Windows OS based:

default charset = windows-1252

getBytes() with default charset, size = 8

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16), size = 18

Difference between String.length() and String.getBytes().length

String.length()

String.length() is the number of 16-bit UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string and thus also equal to toCharArray().length. For most characters used in western languages this is typically the same as the number of unicode characters (code points) in the string, but the number of code point will be less than the number of code units if any UTF-16 surrogate pairs are used. Such pairs are needed only to encode characters outside the BMP and are rarely used in most writing (emoji are a common exception).

String.getBytes().length

String.getBytes().length on the other hand is the number of bytes needed to represent your string in the platform's default encoding. For example, if the default encoding was UTF-16 (rare), it would be exactly 2x the value returned by String.length() (since each 16-bit code unit takes 2 bytes to represent). More commonly, your platform encoding will be a multi-byte encoding like UTF-8.

This means the relationship between those two lengths are more complex. For ASCII strings, the two calls will almost always produce the same result (outside of unusual default encodings that don't encode the ASCII subset in 1 byte). Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.

Which is more suitable?

Usually you'll use String.length() in concert with other string methods that take offsets into the string. E.g., to get the last character, you'd use str.charAt(str.length()-1). You'd only use the getBytes().length if for some reason you were dealing with the array-of-bytes encoding returned by getBytes.

How to convert Java String into byte[]?

The object your method decompressGZIP() needs is a byte[].

So the basic, technical answer to the question you have asked is:

byte[] b = string.getBytes();
byte[] b = string.getBytes(Charset.forName("UTF-8"));
byte[] b = string.getBytes(StandardCharsets.UTF_8); // Java 7+ only

However the problem you appear to be wrestling with is that this doesn't display very well. Calling toString() will just give you the default Object.toString() which is the class name + memory address. In your result [B@38ee9f13, the [B means byte[] and 38ee9f13 is the memory address, separated by an @.

For display purposes you can use:

Arrays.toString(bytes);

But this will just display as a sequence of comma-separated integers, which may or may not be what you want.

To get a readable String back from a byte[], use:

String string = new String(byte[] bytes, Charset charset);

The reason the Charset version is favoured, is that all String objects in Java are stored internally as UTF-16. When converting to a byte[] you will get a different breakdown of bytes for the given glyphs of that String, depending upon the chosen charset.

How many bytes does a string contains?

Depends. What encoding do you want to use?

System.out.println("äö".getBytes("UTF-8").length);

Prints 4, but if I change UTF-8 to ISO-8859-1 (for example), it'll print 2. Other encodings may print other values (try UTF-32).

Is there any way to get the size in bytes of a string in Java?

You probably use about the following to read the file

FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}

You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.

To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)

To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips

It is also not 100% accurate in terms of actual bytes read from the file for following reasons:

  • UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.

  • Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.

  • Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.

The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)

That means so far that regardless of what you do, you get no more than an approximation.

Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.

An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.

The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.

This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.

static class CountingInputStream extends FilterInputStream {
private long bytesRead;

protected CountingInputStream(InputStream in) {
super(in);
}

@Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
@Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
@Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}

public long getBytesRead() {
return bytesRead;
}
}

Using the following code

File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;

CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();

I get output like

At line:    0, bytes:   8192 =   5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805

or in case of the same file UTF-16 encoded

At line:    0, bytes:  24576 =   7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612

Instead of printing that you could update your progress.

So, what is the best approach?

  • If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending)
    String#length() is fast and simple and as long as you know what files you have you should have no problems.
  • If your have international text where the simple approach won't work:
    • for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
    • for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
  • If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of InputStreamReader and BufferedReader since Reader already operates on characters. Android's implementation may help as starting point.

Read String and bytes from the same file java

Read as bytes. When you have read a sequence of bytes that you know should be a string, place those bytes in an array, put the array inside a ByteArrayInputStream and use that as the underlying InputStream for a Reader to get the bytes as characters, then read those characters to produce a String.

For the later parts of this process see the related SO question on how to create a String from an InputStream.

Get size of String w/ encoding in bytes without converting to byte[]

Simple, just write it to a dummy output stream:

class CountingOutputStream extends OutputStream {
private int _total;

@Override public void write(int b) {
++_total;
}

@Override public void write(byte[] b) {
_total += b.length;
}

@Override public void write(byte[] b, int offset, int len) {
_total += len;
}

public int getTotalSize(){
_total;
}
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
int end = Math.min(myString.length(), i+8096);
writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

Splitting a string with byte length limits in java

The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:

public final CoderResult encode(CharBuffer in,
ByteBuffer out,
boolean endOfInput)

Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...

In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:

...

CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.

A possible code could be:

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}

This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).



Related Topics



Leave a reply



Submit