Strlen() and Utf-8 Encoding

strlen() and UTF-8 encoding

The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).

Getting the string length on UTF-8 in C?

What do you mean by string length?

The UTF-8 encoding is very well designed and compatible with the definition of C strings, UTF-8 strings are just null terminated arrays of bytes, like ASCII strings.

The number of bytes is easily obtained with strlen(s). If for some reason you cannot use strlen, it is easy to emulate and the algorithm is exactly what you propose in the question:

size_t string_lengh(const char *s) {
size_t length = 0;
while (*s++ != '\0')
length++;
return length;
}

The number of code points encoded in UTF-8 can be computed by counting the number of single byte chars (range 1 to 127) and the number of leading bytes (range 0xC0 to 0xFF), ignoring continuation bytes (range 0x80 to 0xBF) and stopping at '\0'.

Here is a simple function to do this:

size_t count_utf8_code_points(const char *s) {
size_t count = 0;
while (*s) {
count += (*s++ & 0xC0) != 0x80;
}
return count;
}

This function assumes that the contents of the array pointed to by s is properly encoded.

Also note that this will compute the number of code points, not the number of characters displayed, as some of these may be encoded using multiple combining code points, such as <LATIN CAPITAL LETTER A> followed by <COMBINING ACUTE ACCENT>.

UTF-8, sprintf, strlen, etc

I would prefer to avoid wide characters solution...

Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).

You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature (U+FB00). AFAIK, you best bet should be ICU.


(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as ).

References and other examples on Unicode equivalence

utf8 string length

The core PHP string functions all assume 1 character = 1 byte. They have no concept of different encodings. To figure out how many characters are in a UTF-8 string (not how many bytes), use the mb_strlen equivalent and tell it what encoding the string is in:

echo mb_strlen('سلام', 'UTF-8');

Same encoding (UTF-8), but different lengths of string and content (PHP)

So you have 2 strings:

313420d0b8d18ed0bdd18f - this uses 0x20 character as a space.

3134c2a0d0b8d18ed0bdd18f - this uses the 0xC2A0 sequence of bytes as a space (it's the Unicode's non-breaking space).

Apart of those spaces the strings are identical.

To replace the space-alike unicode characters with a regular space you can use the following regular expression:

preg_replace('~\p{Zs}~u', ' ', $str)

References:

  • PHP - Unicode character properties

how get length of UTF-8 strings without mb-strlen?

Well, then you have to write it yourself.

UTF-8

In short, UTF-8 is encoded as follows:

  • If the leftmost bit of a certain byte is a 0, then it is a single-byte character.
  • If the leftmost bit of a certain byte is a 1, then it is part of a multibyte character.
    • If the 1 is followed by another number of 1s, then the number of bytes the character occupies is equal to the number of 1-bits, followed by a 0-bit.
    • Otherwise, the remaining parts of the multibyte character all start with the bits 10.

See here for more info.

For example, suppose we have the following string:

Hëllo현World
01001000 ═ H   --> Starts with 0, so it's a single-byte character
11000011 ╦ ë --> Starts with two 1s followed by 0. Char takes up 2 bytes.
║ This byte is the first one of the 2 bytes. The remaining 1
║ byte MUST start with 10.
10101011 ╝ --> This is a 'continuation' byte, and MUST start with 10.
Well, it does, so it's valid.
01101100 ═ l --> This byte start with 0, so it's a normal byte, again.
01101100 ═ l
01101111 ═ o
11101101 ╗ --> Starts with three 1-bits. So the character takes up 3 bytes.
║ The next 3-1=2 bytes must start with 10
10011000 ╬ 현 --> Continuation byte
10000100 ╝ --> Continuation byte
01010111 ═ W --> Normal byte
01101111 ═ o
01110010 ═ r
01101100 ═ l
01100100 ═ d

Code

It is sufficient to just count all bytes not starting with bits 10. With other words, if the byte is not in the range 128-191 inclusive.

$str = "Hëllo현World";

// ë takes up 2 bytes
// 현 takes up 3 bytes
// In a decent browser you see 11 characters (ten Latin, one Chinese)

$len = 0;
for ($i = 0; $i < strlen($str); $i++) {
$ascii = ord($str[$i]);
if ($ascii < 128 || $ascii >= 192) {
$len++;
}
}

echo "Number of bytes: ".strlen($str)."\n";
echo "Number of characters: ".$len;

Here is an online demo.


PS: Is there a reason you don't want to enable multibyte strings?

How do I get the length of UTF-8 string PHP?

It sounds like the input ($string) is in another encoding - probably iso-8859-1 (especially if mb_strlen() == strlen()).

If $string has come from a form input, you need ensure that the form in posting in UTF-8 format. Unless specified the default is often iso-8859-1.

This is done with decent browsers by doing:

<form action="form.php" method="POST" accept-charset="utf-8">

Strlen not returning the correct string length

There is an issue with the character â as it is a special character which uses a different encoding. Characters like this are actually double characters this is why its giving 30 and not 29

To fix this, you need to use mb_strlen() with encoding

$myString = 'Câmara de Dirigentes Lojistas';

echo mb_strlen($myString,'utf8')

NOTE : If mb_strlen is undefined, then you will have to enable mb extension in your php settings

PHP strlen and mb_strlen not working as expected

Double-check whether your text really is UTF-8 or not. That "Â" character makes it look like a classic character encoding problem to me. You should check the entire path from the origin of the text through the point in your code that you quoted above, because there are a lot of places where the encodings can get munged.

Did the text originate from an HTML form? Ensure your <form> element includes the accept-charset="UTF-8" attribute.

Did the text get stored in a database along the way? Make sure the database stores and returns the data in UTF-8. This means checking the server's global defaults, the defaults for the database or schema, and the table itself.



Related Topics



Leave a reply



Submit