How to Iterate Through the Unicode Codepoints of a Java String

How can I iterate through the unicode codepoints of a Java String?

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);

// do something with the codepoint

offset += Character.charCount(codepoint);
}

Iterate through unicode characters dynamically

Unicode in the range U+1200 to U+137F covers Ethiopic as well as Amharic, so it exists in the BMP (Basic Multilingual Plane) and can be represented by a 16 bit value.

doing "(char)i" converts it to an ASCII character [???]

False. Unlike some other languages, a char in Java is 2 bytes large, so that is sufficient for your purposes.

For more information see: Comparing a char to a code-point?

What is the easiest/best/most correct way to iterate through the characters of a string in Java?

I use a for loop to iterate the string and use charAt() to get each character to examine it. Since the String is implemented with an array, the charAt() method is a constant time operation.

String s = "...stuff...";

for (int i = 0; i < s.length(); i++){
char c = s.charAt(i);
//Process char
}

That's what I would do. It seems the easiest to me.

As far as correctness goes, I don't believe that exists here. It is all based on your personal style.

How to iterate over over all Unicode characters?

According to the docs, the parameter passed to String.fromCharCode(a) is converted calling ToUint16 and then said character is returned. You may call it with any number you want but the values will be capped to between 0 and 216 or 232

highNumber = 500; //This could go very high
out = ""
for(i=0;i<highNumber;i++){
out += String.fromCharCode(i);
}
console.log(out);

Danger note if you run this code using 2^16you may freeze your tab or browser, it's way too big. This is understanding you want to iterate over all characters and not all characters in a given string which is quite a different thing.

A sample output of a more reasonable highNumber(ie 500) is the following:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
stuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæç
èéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺ
ĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍ
ƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠ
ǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdz

Iterating through Unicode codepoints character by character

Use the ICU library.

http://site.icu-project.org/

for example:

http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90

is the function that returns the character at a particular character offset in a string.

How to iterate through unicode characters and print them on the screen with printf in C?

If the __STDC_ISO_10646__ macro is defined, wide characters correspond to Unicode codepoints. So, assuming a locale that can represent the characters you are interested in, you can just printf() wide characters via the %lc format conversion:

#include <stdio.h>
#include <locale.h>

#ifndef __STDC_ISO_10646__
#error "Oops, our wide chars are not Unicode codepoints, sorry!"
#endif
int main()
{
int i;
setlocale(LC_ALL, "");

for (i = 0; i < 0xffff; i++) {
printf("%x - %lc\n", i, i);
}

return 0;
}


Related Topics



Leave a reply



Submit