Accessing chars in utf-8 strings
So would the only way to random access that character be a linear search? (Start from the beginning and for each char check it's length skipping over that many until I reach the correct character index).
Yes, exactly.
If so, why does everyone want to store files in utf-8?
UTF-8 is more portable than UTF-16 or UTF-32 (UTF-8 has no endian issues), and is backwards compatible with ASCII, so it won't break a large majority of legacy apps. Also, UTF-8 is more compact in byte size than UTF-16 for Unicode codepoints U+0000 - U+007F, and is the same byte size as UTF-16 for codepoints U+0080 - U+07FF. So UTF-8 tends to be a better choice for handling the majority of the world's commonly used English/Latin-based languages. However, once you start dealing with Unicode codepoints above U+07FF (Asian languages, symbols, emojis, etc), UTF-16 usually becomes more compact than UTF-8
UTF-16 tends to be easier to work with when processing data, since it only deals with 1 codeunit for codepoints U+0000 - U+FFFF, compared to UTF-8's use of 1-3 codeunits for the same codepoints. UTF-16 uses 2 codeunits for the remaining codepoints, compared to UTF-8's use of 4 codeunits for the same codepoints.
But even then, UTF-16 is technically a variable-length encoding, so you can't really use random access with it, either. True random access is possible in UTF-8 only if the data contains codepoints U+0000 - U+007F and nothing higher, and is possible in UTF-16 only if the data contains codepoints U+0000 - U+FFFF and nothing higher. Anything else requires linear scanning. However, scanning through UTF-16 is easier than scanning through UTF-8 since fewer codeunits are involved. And UTF-16 is designed to easily detect leading and trailing codeunits to skip them during scanning, whereas UTF-8 does not lend itself as well to that.
Wouldn't that just make parsing and analysis hugely more expensive?
UTF-8 is better suited for storage and communications, but not necessarily easier for parsing. It depends on the languages involved. UTF-16 tends to be better suited for parsing, as long as you account for surrogate pairs.
If you don't want to handle variable-length characters, and need true random access, then use UTF-32 instead, since it uses only 1 codeunit for every possible codepoint.
in context, I'm writing a language lexer and information all around says source files should be in utf-8, but if I support variable length characters, wouldn't that just complicate everything unnecessarily?
Not necessarily, especially if you are only supporting forward parsing. Even with UTF-16, you have to account for variable-length characters as well.
Would it be acceptable to just support utf-8/ascii with just single byte characters for source files?
That depends on the requirements of your parser, but I would say no. Many users want to be able to embed Unicode data in their source files, and even use Unicode identifiers if possible. Even back in the Ansi days before Unicode, non-ASCII characters could be either single-byte or multi-byte depending on the charset used.
So unless you want to completely shun non-ASCII languages (which is not a good idea in today's international world), you should deal with variable-length characters in one form or another.
PHP string functions: which ones will work with UTF-8?
Core PHP SBCS string functions
Assuming the default encoding of PHP is set to UTF-8, these string functions will work:
echo
Output one or more stringshtml_entity_decode
Convert all HTML entities to their applicable charactershtmlentities
Convert all applicable characters to HTML entities | better usehtmlspecialchars_decode
Convert special HTML entities back to charactershtmlspecialchars
Convert special characters to HTML entitiesimplode
Join array elements with a stringjoin
Alias of implodenl2br
Inserts HTML line breaks before all newlines in a stringprint
Output a stringquotemeta
Quote meta charactersstr_repeat
Repeat a stringstr_rot13
Perform the rot13 transform on a stringstrip_tags
Strip HTML and PHP tags from a stringstripcslashes
Un-quote string quoted with addcslashesstripslashes
Un-quotes a quoted string
Unfortunately all other string functions do not work with UTF-8.
Obstacles:
- case handling or spaces does not work with UTF-8
- string lengths in parameters and return values are not in character lengths
- string processing causes data corruption
- string function is comletely ASCII oriented
In some cases functions can work as expected when parameters are US-ASCII and
lengths are byte lenghts.
Binary string function are still useful:
bin2hex
Convert binary data into hexadecimal representationchr
Return a specific character (=byte)convert_uudecode
Decode a uuencoded stringconvert_uuencode
Uuencode a stringcrc32
Calculates the crc32 polynomial of a stringcrypt
One-way string hashinghex2bin
Decodes a hexadecimally encoded binary stringmd5_file
Calculates the md5 hash of a given filemd5
Calculate the md5 hash of a stringord
Return ASCII value of character (=byte)sha1_file
Calculate the sha1 hash of a filesha1
Calculate the sha1 hash of a string
Configuration functions do not apply:
get_html_translation_table
Returns the translation table used by htmlspecialchars and htmlentitieslocaleconv
Get numeric formatting informationnl_langinfo
Query language and locale informationsetlocale
Set locale information
Regular expression functions and encoding and transcoding functions are not considered.
Extentions
In quite a few cases, Multibyte String
offers an UTF-8 variant:
mb_convert_case
Perform case folding on a stringmb_parse_str
Parse GET/POST/COOKIE data and set global variablemb_split
Split multibyte string using regular expressionmb_strcut
Get part of stringmb_strimwidth
Get truncated string with specified widthmb_stripos
Finds position of first occurrence of a string within another, case insensitivemb_stristr
Finds first occurrence of a string within another, case insensitivemb_strlen
Get string lengthmb_strpos
Find position of first occurrence of string in a stringmb_strrchr
Finds the last occurrence of a character in a string within anothermb_strrichr
Finds the last occurrence of a character in a string within another, case insensitivemb_strripos
Finds position of last occurrence of a string within another, case insensitivemb_strrpos
Find position of last occurrence of a string in a stringmb_strstr
Finds first occurrence of a string within anothermb_strtolower
Make a string lowercasemb_strtoupper
Make a string uppercasemb_strwidth
Return width of stringmb_substr_count
Count the number of substring occurrencesmb_substr
Get part of string
And iconv provides a bare minimum of string functions:
iconv_strlen
Returns the character count of stringiconv_strpos
Finds position of first occurrence of a needle within a haystackiconv_strrpos
Finds the last occurrence of a needle within a haystackiconv_substr
Cut out part of a string
Lastly Intl has a lot of extra and powerful Unicode features (but no regular expressions) as part of i18n. Some features overlap with other string functions. With respect to string functions these are:
- IntlBreakIterators
- Grapheme Functions
Substring or characterAt method for UTF8 Strings with 2+ bytes in JAVA
Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support?
You're not after UTF-8 support really. You're after Unicode code points (plain 32-bit integers), rather than UTF-16 code units. And yes, Java provides support for this, but it's not hugely easy to work with.
For example, to get a particular code point, use String.codePointAt
- bearing in mind that the index you provide is in terms of UTF-16 code units, not code points.
To find the length in code points, use String.codePointCount
.
To find a substring, you need to find the offset in terms of UTF-16 code units, then use the normal substring
method; use String.offsetByCodePoints
to find the right index.
Basically look through the String
API at all the methods which contain codePoint
.
How to do substring for UTF8 string in java?
If you want to trim the data in Java you must write a function that trims the string using the db charset used, something like this test case:
package test;
import java.io.UnsupportedEncodingException;
public class TrimField {
public static void main(String[] args) {
//UTF-8 is the db charset
System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
}
public static String trim(String value, int numBytes, String charset) {
do {
byte[] valueInBytes = null;
try {
valueInBytes = value.getBytes(charset);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e.getMessage(), e);
}
if (valueInBytes.length > numBytes) {
value = value.substring(0, value.length() - 1);
} else {
return value;
}
} while (value.length() > 0);
return "";
}
}
java utf8 encoding - char, string types
Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.
If you do not pass a parameter value to String.getBytes()
, it returns a byte array that has the String
contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8")
instead.
Calling String.charAt()
returns an original UTF-16 encoded char from the String's in-memory storage only.
So in your example, the Unicode character ョ
is stored in the String
in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF
or 0xFF 0x6E
depending on endian), but is stored in the byte array from getBytes()
using three bytes that are encoded using whatever the OS default charset is.
In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE
).
How to convert Strings to and from UTF8 byte arrays in Java
Convert from String
to byte[]
:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
Convert from byte[]
to String
:
byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);
You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.
How to UTF-8 encode a character/string
If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.
Related Topics
Rails 404 Error for Stylesheet or JavaScript Files
Inserting an Array into Every Nth Element of Another
Ruby String Split into Words Ignoring All Special Characters: Simpler Query
Rails Adding Multiple Objects to an Empty Array
How to Spec Methods That Exit or Abort
Rails: Uninitialized Constant Just Happen on Production Server
Gem Install Therubyracer -V 0.11.4 Fails on Os X 10.10
Secure Erasing of Password from Memory in Ruby
Error When Trying to Create Heroku App on Windows
Why Capypara + Rspect Tests Still Pass Even Though I Delete Application.Js File
Elegant Way to Only Show Records If They Exist in Rails Erb
Integrate Shoes into Aptana Studio Radrails
Open Xml File with Nokogiri Update Node and Save
How to Override the System Timezone in Ruby
How to Get the Nth Element of an Enumerable in Ruby