How to Know the Size of the String in Bytes

How to know the size of the string in bytes?

You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding class.

or try this

  System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);

Python : Get size of string in bytes

If you want the number of bytes in a string, this function should do it for you pretty solidly.

def utf8len(s):
return len(s.encode('utf-8'))

The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in python.

Its interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).

How to get the string size in bytes?

You can use strlen. Size is determined by the terminating null-character, so passed string should be valid.

If you want to get size of memory buffer, that contains your string, and you have pointer to it:

  • If it is dynamic array(created with malloc), it is impossible to get
    it size, since compiler doesn't know what pointer is pointing at.
    (check this)
  • If it is static array, you can use sizeof to get its size.

If you are confused about difference between dynamic and static arrays, check this.

How to count String bytes properly?

The word endereço should return me length 9 instead of 8.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

but String length method or getting the length of it with the byte
array returned from getBytes method doesn't return special chars
counted as two bytes.


String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc :

Returns the length of this string. The length is equal to the number
of Unicode code units in the string.


The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.

So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.

Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.

At last, if the String cannot be encoded in the default charset, the behavior is unspecified.

So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc :

Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the
default charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().


Solution

byte[] getBytes() method comes with two other very useful overloads :

  • byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException

  • byte[] java.lang.String.getBytes(Charset charset)

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the
given charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.

This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
java.nio.charset.CharsetEncoder class should be used when more control
over the encoding process is required.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

And you will get a length of 9 bytes as you wish.

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);

// String sample
String yourString = "endereço";

// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

Output on my machine that is Windows OS based:

default charset = windows-1252

getBytes() with default charset, size = 8

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16), size = 18

Get the size of a string

Refer to this link: How to know the size of the string in bytes?

System.Text.Encoding.Unicode.GetByteCount(s);
System.Text.Encoding.ASCII.GetByteCount(s);

or from msdn: http://msdn.microsoft.com/en-us/library/system.string.aspx

How to get string's byte length?

From the MSDN:

The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string

So a1.Length is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char). Many emoji for example use TWO code units (two char, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.

If you want the size IN BYTES, a1.Length * 2 clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)

Is there any way to get the size in bytes of a string in Java?

You probably use about the following to read the file

FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}

You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.

To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)

To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips

It is also not 100% accurate in terms of actual bytes read from the file for following reasons:

  • UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.

  • Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.

  • Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.

The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)

That means so far that regardless of what you do, you get no more than an approximation.

Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.

An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.

The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.

This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.

static class CountingInputStream extends FilterInputStream {
private long bytesRead;

protected CountingInputStream(InputStream in) {
super(in);
}

@Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
@Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
@Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}

public long getBytesRead() {
return bytesRead;
}
}

Using the following code

File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;

CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();

I get output like

At line:    0, bytes:   8192 =   5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805

or in case of the same file UTF-16 encoded

At line:    0, bytes:  24576 =   7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612

Instead of printing that you could update your progress.

So, what is the best approach?

  • If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending)
    String#length() is fast and simple and as long as you know what files you have you should have no problems.
  • If your have international text where the simple approach won't work:
    • for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
    • for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
  • If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of InputStreamReader and BufferedReader since Reader already operates on characters. Android's implementation may help as starting point.

Calculate the size in bytes of a Swift String

It all depends on the character encoding, let's suppose UTF8:

let string = "abde"
let size = string.utf8.count

Note that not all characters have the same byte size in UTF8.
If your string is ASCII, you can assume 1 byte per character.



Related Topics



Leave a reply



Submit