How to Use String Methods on Utf-8 Characters

Accessing chars in utf-8 strings

So would the only way to random access that character be a linear search? (Start from the beginning and for each char check it's length skipping over that many until I reach the correct character index).

Yes, exactly.

If so, why does everyone want to store files in utf-8?

UTF-8 is more portable than UTF-16 or UTF-32 (UTF-8 has no endian issues), and is backwards compatible with ASCII, so it won't break a large majority of legacy apps. Also, UTF-8 is more compact in byte size than UTF-16 for Unicode codepoints U+0000 - U+007F, and is the same byte size as UTF-16 for codepoints U+0080 - U+07FF. So UTF-8 tends to be a better choice for handling the majority of the world's commonly used English/Latin-based languages. However, once you start dealing with Unicode codepoints above U+07FF (Asian languages, symbols, emojis, etc), UTF-16 usually becomes more compact than UTF-8

UTF-16 tends to be easier to work with when processing data, since it only deals with 1 codeunit for codepoints U+0000 - U+FFFF, compared to UTF-8's use of 1-3 codeunits for the same codepoints. UTF-16 uses 2 codeunits for the remaining codepoints, compared to UTF-8's use of 4 codeunits for the same codepoints.

But even then, UTF-16 is technically a variable-length encoding, so you can't really use random access with it, either. True random access is possible in UTF-8 only if the data contains codepoints U+0000 - U+007F and nothing higher, and is possible in UTF-16 only if the data contains codepoints U+0000 - U+FFFF and nothing higher. Anything else requires linear scanning. However, scanning through UTF-16 is easier than scanning through UTF-8 since fewer codeunits are involved. And UTF-16 is designed to easily detect leading and trailing codeunits to skip them during scanning, whereas UTF-8 does not lend itself as well to that.

Wouldn't that just make parsing and analysis hugely more expensive?

UTF-8 is better suited for storage and communications, but not necessarily easier for parsing. It depends on the languages involved. UTF-16 tends to be better suited for parsing, as long as you account for surrogate pairs.

If you don't want to handle variable-length characters, and need true random access, then use UTF-32 instead, since it uses only 1 codeunit for every possible codepoint.

in context, I'm writing a language lexer and information all around says source files should be in utf-8, but if I support variable length characters, wouldn't that just complicate everything unnecessarily?

Not necessarily, especially if you are only supporting forward parsing. Even with UTF-16, you have to account for variable-length characters as well.

Would it be acceptable to just support utf-8/ascii with just single byte characters for source files?

That depends on the requirements of your parser, but I would say no. Many users want to be able to embed Unicode data in their source files, and even use Unicode identifiers if possible. Even back in the Ansi days before Unicode, non-ASCII characters could be either single-byte or multi-byte depending on the charset used.

So unless you want to completely shun non-ASCII languages (which is not a good idea in today's international world), you should deal with variable-length characters in one form or another.

PHP string functions: which ones will work with UTF-8?

Core PHP SBCS string functions

Assuming the default encoding of PHP is set to UTF-8, these string functions will work:

  • echo Output one or more strings
  • html_entity_decode Convert all HTML entities to their applicable characters
  • htmlentities Convert all applicable characters to HTML entities | better use
  • htmlspecialchars_decode Convert special HTML entities back to characters
  • htmlspecialchars Convert special characters to HTML entities
  • implode Join array elements with a string
  • join Alias of implode
  • nl2br Inserts HTML line breaks before all newlines in a string
  • print Output a string
  • quotemeta Quote meta characters
  • str_repeat Repeat a string
  • str_rot13 Perform the rot13 transform on a string
  • strip_tags Strip HTML and PHP tags from a string
  • stripcslashes Un-quote string quoted with addcslashes
  • stripslashes Un-quotes a quoted string

Unfortunately all other string functions do not work with UTF-8.
Obstacles:

  • case handling or spaces does not work with UTF-8
  • string lengths in parameters and return values are not in character lengths
  • string processing causes data corruption
  • string function is comletely ASCII oriented

In some cases functions can work as expected when parameters are US-ASCII and
lengths are byte lenghts.

Binary string function are still useful:

  • bin2hex Convert binary data into hexadecimal representation
  • chr Return a specific character (=byte)
  • convert_uudecode Decode a uuencoded string
  • convert_uuencode Uuencode a string
  • crc32 Calculates the crc32 polynomial of a string
  • crypt One-way string hashing
  • hex2bin Decodes a hexadecimally encoded binary string
  • md5_file Calculates the md5 hash of a given file
  • md5 Calculate the md5 hash of a string
  • ord Return ASCII value of character (=byte)
  • sha1_file Calculate the sha1 hash of a file
  • sha1 Calculate the sha1 hash of a string

Configuration functions do not apply:

  • get_html_translation_table Returns the translation table used by htmlspecialchars and htmlentities
  • localeconv Get numeric formatting information
  • nl_langinfo Query language and locale information
  • setlocale Set locale information

Regular expression functions and encoding and transcoding functions are not considered.

Extentions

In quite a few cases, Multibyte String
offers an UTF-8 variant:

  • mb_convert_case Perform case folding on a string
  • mb_parse_str Parse GET/POST/COOKIE data and set global variable
  • mb_split Split multibyte string using regular expression
  • mb_strcut Get part of string
  • mb_strimwidth Get truncated string with specified width
  • mb_stripos Finds position of first occurrence of a string within another, case insensitive
  • mb_stristr Finds first occurrence of a string within another, case insensitive
  • mb_strlen Get string length
  • mb_strpos Find position of first occurrence of string in a string
  • mb_strrchr Finds the last occurrence of a character in a string within another
  • mb_strrichr Finds the last occurrence of a character in a string within another, case insensitive
  • mb_strripos Finds position of last occurrence of a string within another, case insensitive
  • mb_strrpos Find position of last occurrence of a string in a string
  • mb_strstr Finds first occurrence of a string within another
  • mb_strtolower Make a string lowercase
  • mb_strtoupper Make a string uppercase
  • mb_strwidth Return width of string
  • mb_substr_count Count the number of substring occurrences
  • mb_substr Get part of string

And iconv provides a bare minimum of string functions:

  • iconv_strlen Returns the character count of string
  • iconv_strpos Finds position of first occurrence of a needle within a haystack
  • iconv_strrpos Finds the last occurrence of a needle within a haystack
  • iconv_substr Cut out part of a string

Lastly Intl has a lot of extra and powerful Unicode features (but no regular expressions) as part of i18n. Some features overlap with other string functions. With respect to string functions these are:

  • IntlBreakIterators
  • Grapheme Functions

Substring or characterAt method for UTF8 Strings with 2+ bytes in JAVA

Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support?

You're not after UTF-8 support really. You're after Unicode code points (plain 32-bit integers), rather than UTF-16 code units. And yes, Java provides support for this, but it's not hugely easy to work with.

For example, to get a particular code point, use String.codePointAt - bearing in mind that the index you provide is in terms of UTF-16 code units, not code points.

To find the length in code points, use String.codePointCount.

To find a substring, you need to find the offset in terms of UTF-16 code units, then use the normal substring method; use String.offsetByCodePoints to find the right index.

Basically look through the String API at all the methods which contain codePoint.

How to do substring for UTF8 string in java?

If you want to trim the data in Java you must write a function that trims the string using the db charset used, something like this test case:

package test;

import java.io.UnsupportedEncodingException;

public class TrimField {

public static void main(String[] args) {
//UTF-8 is the db charset
System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
}

public static String trim(String value, int numBytes, String charset) {
do {
byte[] valueInBytes = null;
try {
valueInBytes = value.getBytes(charset);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e.getMessage(), e);
}
if (valueInBytes.length > numBytes) {
value = value.substring(0, value.length() - 1);
} else {
return value;
}
} while (value.length() > 0);
return "";

}

}

java utf8 encoding - char, string types

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

So in your example, the Unicode character is stored in the String in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is.

In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE).

How to convert Strings to and from UTF8 byte arrays in Java

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);

You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.

How to UTF-8 encode a character/string

If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.

Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.



Related Topics



Leave a reply



Submit