How to know the size of the string in bytes?
You can use encoding like ASCII to get a character per byte by using the System.Text.Encoding
class.
or try this
System.Text.ASCIIEncoding.Unicode.GetByteCount(string);
System.Text.ASCIIEncoding.ASCII.GetByteCount(string);
Python : Get size of string in bytes
If you want the number of bytes in a string, this function should do it for you pretty solidly.
def utf8len(s):
return len(s.encode('utf-8'))
The reason you got weird numbers is because encapsulated in a string is a bunch of other information due to the fact that strings are actual objects in python.
Its interesting because if you look at my solution to encode the string into 'utf-8', there's an 'encode' method on the 's' object (which is a string). Well, it needs to be stored somewhere right? Hence, the higher than normal byte count. Its including that method, along with a few others :).
How to get the string size in bytes?
You can use strlen. Size is determined by the terminating null-character, so passed string should be valid.
If you want to get size of memory buffer, that contains your string, and you have pointer to it:
- If it is dynamic array(created with malloc), it is impossible to get
it size, since compiler doesn't know what pointer is pointing at.
(check this) - If it is static array, you can use
sizeof
to get its size.
If you are confused about difference between dynamic and static arrays, check this.
How to count String bytes properly?
The word endereço should return me length 9 instead of 8.
If you expect to have a size of 9 bytes for the "endereço"
String that has a length of 8 characters : 7 ASCII
characters and 1 not ASCII
character, I suppose that you want to use UTF-8
charset that uses 1 byte for characters included in the ASCII table and more for the others.
but String length method or getting the length of it with the byte
array returned from getBytes method doesn't return special chars
counted as two bytes.
String
length()
method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply char
s are contained in?"
String
length()
Javadoc :
Returns the length of this string. The length is equal to the number
of Unicode code units in the string.
The byte[]
getBytes()
method with no argument encodes the String into a byte array. You could use the length
property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
But the byte[]
getBytes()
method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.
byte[]
getBytes()
Javadoc :
Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array.The behavior of this method when this string cannot be encoded in the
default charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.
In your String example "endereço"
, if getBytes()
returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8
but a charset using 1 byte fixed width by character such as ISO 8859-1
and its derived charsets such as windows-1252
for Windows OS based.
To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset()
.
Solution
byte[]
getBytes()
method comes with two other very useful overloads :
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
byte[] java.lang.String.getBytes(Charset charset)
Contrary to the getBytes()
method with no argument, these methods allow to specify the charset to use during the byte encoding.
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
Javadoc :
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.The behavior of this method when this string cannot be encoded in the
given charset is unspecified. The java.nio.charset.CharsetEncoder
class should be used when more control over the encoding process is
required.
byte[] java.lang.String.getBytes(Charset charset)
Javadoc :
Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
java.nio.charset.CharsetEncoder class should be used when more control
over the encoding process is required.
You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .
For example to get an UTF-8
encoding byte array by using getBytes(String charsetName)
you can do that :
String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;
And you will get a length of 9 bytes as you wish.
Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8
and UTF-16
:
public static void main(String[] args) throws UnsupportedEncodingException {
// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);
// String sample
String yourString = "endereço";
// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());
// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());
// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}
Output on my machine that is Windows OS based:
default charset = windows-1252
getBytes() with default charset, size = 8
getBytes("UTF-8"), size = 9
getBytes(StandardCharsets.UTF_8), size = 9
getBytes("UTF-16"), size = 18
getBytes(StandardCharsets.UTF_16), size = 18
Get the size of a string
Refer to this link: How to know the size of the string in bytes?
System.Text.Encoding.Unicode.GetByteCount(s);
System.Text.Encoding.ASCII.GetByteCount(s);
or from msdn: http://msdn.microsoft.com/en-us/library/system.string.aspx
How to get string's byte length?
From the MSDN:
The .NET Framework uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and string
So a1.Length
is in UTF-16 code units (What's the difference between a character, a code point, a glyph and a grapheme?). Cyrillic characters, being in the base BMP (Base Multilingual Plane), all use a single code unit (so a single char
). Many emoji for example use TWO code units (two char
, 4 bytes!)... They aren't in the BMP. See for example https://ideone.com/ASDORp.
If you want the size IN BYTES, a1.Length * 2
clearly is the length :-) If you want to know in UTF8 (a very common encoding, NOT USED INTERNALLY BY .NET, but very used by the web, xml, ...) how many bytes it would be Encoding.UTF8.GetByteCount(a1)
Is there any way to get the size in bytes of a string in Java?
You probably use about the following to read the file
FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}
You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.
To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)
To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length()
. One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips
It is also not 100% accurate in terms of actual bytes read from the file for following reasons:
UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the
String
you get from your reader. So you are already some bytes off here.Turning every line of a file into an UTF-16
String
will include those BOM bytes for each line. SogetBytes
will report 2 bytes too much for each line.Line ending characters are not part of the resulting line-
String
. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style'\n'
which is only 1 character or the Windows-Style'\r''\n'
which is two characters. TheBufferedReader
will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.
The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)
That means so far that regardless of what you do, you get no more than an approximation.
Getting the actual byte-count could probably be done if you write your own special byte counting Reader
but that would be quite a lot of work.
An alternative would be to use a custom InputStream
that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.
The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader
will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.
This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark()
and reset()
. File reading should no do that though.
static class CountingInputStream extends FilterInputStream {
private long bytesRead;
protected CountingInputStream(InputStream in) {
super(in);
}
@Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
@Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
@Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}
public long getBytesRead() {
return bytesRead;
}
}
Using the following code
File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;
CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();
I get output like
At line: 0, bytes: 8192 = 5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805
or in case of the same file UTF-16 encoded
At line: 0, bytes: 24576 = 7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612
Instead of printing that you could update your progress.
So, what is the best approach?
- If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use
String#length()
(and maybe add +1 or +2 for the line ending)String#length()
is fast and simple and as long as you know what files you have you should have no problems. - If your have international text where the simple approach won't work:
- for smaller files where processing each line takes rather long:
String#getBytes()
, the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end. - for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
- for smaller files where processing each line takes rather long:
- If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of
InputStreamReader
andBufferedReader
since Reader already operates on characters. Android's implementation may help as starting point.
Calculate the size in bytes of a Swift String
It all depends on the character encoding, let's suppose UTF8:
let string = "abde"
let size = string.utf8.count
Note that not all characters have the same byte size in UTF8.
If your string is ASCII, you can assume 1 byte per character.
Related Topics
Invalidoperationexception - Object Is Currently in Use Elsewhere
System.Web.Httpcontext.Current.User.Identity.Name VS System.Environment.Username in ASP.NET
How Math.Pow (And So On) Actually Works
.Net Regex Matching $ with the End of the String and Not of Line, Even with Multiline Enabled
Internal VS. Private Access Modifiers
Spawn Multiple Threads for Work Then Wait Until All Finished
Getting Downloads Folder in C#
Remove File Extension from a File Name String
String.Replace() VS. Stringbuilder.Replace()
Performance of Find() VS. Firstordefault()
How to Initialize a C# Attribute with an Array or Other Variable Number of Arguments
Entity Framework Code First Lazy Loading
How Does Datetime.Touniversaltime() Work
How to Hide Desktop Icons Programmatically
A Type for Date Only in C# - Why Is There No Date Type
C# Generic "Where Constraint" with "Any Generic Type" Definition