How does BreakIterator work in Android?
BreakIterator
can be used to find the possible breaks between characters, words, lines, and sentences. This is useful for things like moving the cursor through visible characters, double clicking to select words, triple clicking to select sentences, and line wrapping.
Boilerplate code
The following code is used in the examples below. Just adjust the first part to change the text and type of BreakIterator
.
// change these two lines for the following examples
String text = "This is some text.";
BreakIterator boundary = BreakIterator.getCharacterInstance();
// boiler plate code
boundary.setText(text);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; end = boundary.next()) {
System.out.println(start + " " + text.substring(start, end));
start = end;
}
If you just want to test this out, you can paste it directly into an Activity's onCreate
in Android. I'm using System.out.println
rather than Log
so that it is also testable in a Java only environment.
I'm using the java.text.BreakIterator
rather than the ICU one, which is only available from API 24. See the links at the bottom for more information.
Characters
Change the boilerplate code to include the following
String text = "Hi 中文éé\uD83D\uDE00\uD83C\uDDEE\uD83C\uDDF3.";
BreakIterator breakIterator = BreakIterator.getCharacterInstance();
Output
0 H
1 i
2
3 中
4 文
5 é
6 é
8 br>10 br>14 .
The most interest parts are at indexes 6
, 8
, and 10
. Your browser may or may not display the characters correctly, but a user would interpret all of these to be single characters even though they are made up of multiple UTF-16 values.
Words
Change the boilerplate code to include the following:
String text = "I like to eat apples. 我喜欢吃苹果。";
BreakIterator boundary = BreakIterator.getWordInstance();
Output
0 I
1
2 like
6
7 to
9
10 eat
13
14 apples
20 .
21
22 我
23 喜欢
25 吃
26 苹果
28 。
There are a few interesting things to note here. First, a word break is detected at both sides of a space. Second, even though there are different languages, multi-character Chinese words were still recognized. This was still true in my tests even when I set the locale to Locale.US
.
Lines
You can keep the code the same as for the Words example:
String text = "I like to eat apples. 我喜欢吃苹果。";
BreakIterator boundary = BreakIterator.getLineInstance();
Output
0 I
2 like
7 to
10 eat
14 apples.
22 我
23 喜
24 欢
25 吃
26 苹
27 果。
Note that the break locations are not whole lines of text. They are just convenient places to line wrap text.
The output is similar to the Words example. However, now white space and punctuation is included with the word before it. This makes sense because you wouldn't want a new line to start with white space or punctuation. Also note that Chinese characters get line breaks for every character. This is consistent with the fact that it is ok to break multi-character words across lines in Chinese.
Sentences
Change the boilerplate code to include the following:
String text = "I like to eat apples. My email is me@example.com.\n" +
"This is a new paragraph. 我喜欢吃苹果。我不爱吃臭豆腐。";
BreakIterator boundary = BreakIterator.getSentenceInstance();
Output
0 I like to eat apples.
22 My email is me@example.com.
50 This is a new paragraph.
75 我喜欢吃苹果。
82 我不爱吃臭豆腐。
Correct sentence breaks were recognized in multiple languages. Also, there was no false positive for the dot in the email domain.
Notes
You can set the Locale when you create a BreakIterator
, but if you don't it just uses the default locale.
Further reading
- Documentation
- ICU version of BreakIterator
- This was one of the more useful tutorials.
Android Studios breakiterator, to break text using only space
I know its bad but it works
List listForStart;
List listForEnd;
TextView definitionView = (TextView) findViewById(R.id.et_MainText);
definitionView.setMovementMethod(LinkMovementMethod.getInstance());
definitionView.setText(definition, TextView.BufferType.SPANNABLE);
Spannable spans = (Spannable) definitionView.getText();
String T = definitionView.getText().toString();
listForStart = new ArrayList<String>();
listForEnd = new ArrayList<String>();
int testStart = 0;
for(int i = 0; i<T.length();i++){ //getting the list completed
if(T.charAt(i)==' '){ //found a space //go backwards unti
listForStart.add(Integer.toString(testStart));
listForEnd.add(Integer.toString(i));
//Log.d("TEST", "init: "+testStart+" "+i);
if(i+1 <T.length()){
testStart = i+1;
}else{break;}
}
}
for(int i = 0;i != listForStart.size();i++){
int start = Integer.valueOf(listForStart.get(i));
int end = Integer.valueOf(listForEnd.get(i));
String possibleWord = definition.substring(start,end);
if (Character.isLetterOrDigit(possibleWord.charAt(0)) ||checkPun(possibleWord.charAt(0)) ||checkSpace(possibleWord.charAt(0))) {
ClickableSpan clickSpan = getClickableSpan(possibleWord);
spans.setSpan(clickSpan, start, end,
Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
//Log.d("ClickableSpan", "init: " + start +" "+ end);
}
}
}
BreakIterator not working correctly with Chinese text
The standard BreakIterator
does not support detection of "word" boundaries within unbroken strings of CJK ideographs. There is a bug report on this subject, but it was closed in 2006 as "Won't Fix".
Instead, you'll need to use the ICU implementation. If you're developing on Android, you already have this as android.icu.text.BreakIterator
. Otherwise, you'll need to download the ICU4J library from http://site.icu-project.org/download, which has it as com.ibm.icu.text.BreakIterator
.
BreakIterator.preceding fails in Android V2?
Below is the validateOffset in Gingerbread code
private void validateOffset(int offset) {
CharacterIterator it = wrapped.getText();
if (offset < it.getBeginIndex() || offset >= it.getEndIndex()) {
throw new IllegalArgumentException();
}
}
and in ICS code is as below
private void validateOffset(int offset) {
CharacterIterator it = wrapped.getText();
if (offset < it.getBeginIndex() || offset > it.getEndIndex()) {
String message = "Valid range is [" + it.getBeginIndex() + " " + it.getEndIndex() + "]";
throw new IllegalArgumentException(message);
}
}
>=
has been changed to >
. The end offset checking seems to be wrong in 2.X devices. This is especially true in your case where the offset you are passing to preceding
overlaps with the end index of the string. This seems to be bug in framework.
You can find the source in AOSP code at libcore/luni/src/main/java/java/text/RuleBasedBreakIterator.java.
Here's the Gingerbread code and here's the ICS code
Related Topics
Apache Httpclient Digest Authentication
How to Show Ellipses on My Textview If It Is Greater Than the 1 Line
How to Get Selected Xls File Path from Uri for Sdk 17 or Below for Android
Android: Cannot Perform This Operation Because the Connection Pool Has Been Closed
No Resource Identifier Found for Attribute 'Layout_Behavior' in Package
How to Ask Permission to Access Gallery on Android M.
How to Run an R Program from Java
Create Random Pixel Images in Swift
Builder Pattern in Effective Java
Unhandled Exception Type Error
Passing Function as a Parameter in Java
"Hello World" Android App with as Few Files as Possible, No Ide, and Text Editor Only
How to Make Edittext Not Focused When Creating Activity
What Is the Use of Basecolumns in Android