public
abstract
class
BreakIterator
extends Object
implements
Cloneable
java.lang.Object | |
↳ | android.icu.text.BreakIterator |
[icu enhancement] ICU's replacement for BreakIterator
. Methods, fields, and other functionality specific to ICU are labeled '[icu]'.
A class that locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of BreakIterator can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide five built-in types of BreakIterator:
BreakIterator's interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like first(), last(), next(), and previous() that update the current position. All BreakIterators uphold the following invariants:
Examples:
Creating and using text boundaries
Print each element in orderpublic static void main(String args[]) { if (args.length == 1) { String stringToExamine = args[0]; //print each word in order BreakIterator boundary = BreakIterator.getWordInstance(); boundary.setText(stringToExamine); printEachForward(boundary, stringToExamine); //print each sentence in reverse order boundary = BreakIterator.getSentenceInstance(Locale.US); boundary.setText(stringToExamine); printEachBackward(boundary, stringToExamine); printFirst(boundary, stringToExamine); printLast(boundary, stringToExamine); } }
Print each element in reverse orderpublic static void printEachForward(BreakIterator boundary, String source) { int start = boundary.first(); for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) { System.out.println(source.substring(start,end)); } }
Print first elementpublic static void printEachBackward(BreakIterator boundary, String source) { int end = boundary.last(); for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary.previous()) { System.out.println(source.substring(start,end)); } }
Print last elementpublic static void printFirst(BreakIterator boundary, String source) { int start = boundary.first(); int end = boundary.next(); System.out.println(source.substring(start,end)); }
Print the element at a specified positionpublic static void printLast(BreakIterator boundary, String source) { int end = boundary.last(); int start = boundary.previous(); System.out.println(source.substring(start,end)); }
Find the next wordpublic static void printAt(BreakIterator boundary, int pos, String source) { int end = boundary.following(pos); int start = boundary.previous(); System.out.println(source.substring(start,end)); }
public static int nextWordStartAfter(int pos, String text) { BreakIterator wb = BreakIterator.getWordInstance(); wb.setText(text); int last = wb.following(pos); int current = wb.next(); while (current != BreakIterator.DONE) { for (int p = last; p < current; p++) { if (Character.isLetter(text.charAt(p))) return last; } last = current; current = wb.next(); } return BreakIterator.DONE; }(The iterator returned by BreakIterator.getWordInstance() is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code uses a simple heuristic to determine which boundary is the beginning of a word: If the characters between this boundary and the next boundary include at least one letter (this can be an alphabetical letter, a CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text between this boundary and the next is a word; otherwise, it's the material between words.)
See also:
Constants | |
---|---|
int |
DONE
DONE is returned by previous() and next() after all valid boundaries have been returned. |
int |
KIND_CHARACTER
[icu] |
int |
KIND_LINE
[icu] |
int |
KIND_SENTENCE
[icu] |
int |
KIND_TITLE
[icu] |
int |
KIND_WORD
[icu] |
int |
WORD_IDEO
Tag value for words containing ideographic characters, lower limit |
int |
WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit |
int |
WORD_KANA
Tag value for words containing kana characters, lower limit |
int |
WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit |
int |
WORD_LETTER
Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit. |
int |
WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit |
int |
WORD_NONE
Tag value for "words" that do not fit into any of other categories. |
int |
WORD_NONE_LIMIT
Upper bound for tags for uncategorized words. |
int |
WORD_NUMBER
Tag value for words that appear to be numbers, lower limit. |
int |
WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit. |
Protected constructors | |
---|---|
BreakIterator()
Default constructor. |
Public methods | |
---|---|
Object
|
clone()
Clone method. |
abstract
int
|
current()
Return the iterator's current position. |
abstract
int
|
first()
Set the iterator to the first boundary position. |
abstract
int
|
following(int offset)
Sets the iterator's current iteration position to be the first boundary position following the specified position. |
static
Locale[]
|
getAvailableLocales()
Returns a list of locales for which BreakIterators can be used. |
static
BreakIterator
|
getCharacterInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates logical-character boundaries. |
static
BreakIterator
|
getCharacterInstance(Locale where)
Returns a new instance of BreakIterator that locates logical-character boundaries. |
static
BreakIterator
|
getCharacterInstance()
Returns a new instance of BreakIterator that locates logical-character boundaries. |
static
BreakIterator
|
getLineInstance(Locale where)
Returns a new instance of BreakIterator that locates legal line- wrapping positions. |
static
BreakIterator
|
getLineInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates legal line- wrapping positions. |
static
BreakIterator
|
getLineInstance()
Returns a new instance of BreakIterator that locates legal line- wrapping positions. |
int
|
getRuleStatus()
For RuleBasedBreakIterators, return the status tag from the break rule that determined the most recently returned break position. |
int
|
getRuleStatusVec(int[] fillInArray)
For RuleBasedBreakIterators, get the status (tag) values from the break rule(s) that determined the most recently returned break position. |
static
BreakIterator
|
getSentenceInstance(Locale where)
Returns a new instance of BreakIterator that locates sentence boundaries. |
static
BreakIterator
|
getSentenceInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates sentence boundaries. |
static
BreakIterator
|
getSentenceInstance()
Returns a new instance of BreakIterator that locates sentence boundaries. |
abstract
CharacterIterator
|
getText()
Returns a CharacterIterator over the text being analyzed. |
static
BreakIterator
|
getTitleInstance(Locale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static
BreakIterator
|
getTitleInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static
BreakIterator
|
getTitleInstance()
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static
BreakIterator
|
getWordInstance(Locale where)
Returns a new instance of BreakIterator that locates word boundaries. |
static
BreakIterator
|
getWordInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates word boundaries. |
static
BreakIterator
|
getWordInstance()
Returns a new instance of BreakIterator that locates word boundaries. |
boolean
|
isBoundary(int offset)
Return true if the specified position is a boundary position. |
abstract
int
|
last()
Set the iterator to the last boundary position. |
abstract
int
|
next()
Advances the iterator forward one boundary. |
abstract
int
|
next(int n)
Move the iterator by the specified number of steps in the text. |
int
|
preceding(int offset)
Sets the iterator's current iteration position to be the last boundary position preceding the specified position. |
abstract
int
|
previous()
Move the iterator backward one boundary. |
void
|
setText(String newText)
Sets the iterator to analyze a new piece of text. |
abstract
void
|
setText(CharacterIterator newText)
Sets the iterator to analyze a new piece of text. |
Inherited methods | |
---|---|
From
class
java.lang.Object
|
int DONE
DONE is returned by previous() and next() after all valid boundaries have been returned.
Constant Value: -1 (0xffffffff)
int WORD_IDEO
Tag value for words containing ideographic characters, lower limit
Constant Value: 400 (0x00000190)
int WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit
Constant Value: 500 (0x000001f4)
int WORD_KANA
Tag value for words containing kana characters, lower limit
Constant Value: 300 (0x0000012c)
int WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit
Constant Value: 400 (0x00000190)
int WORD_LETTER
Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.
Constant Value: 200 (0x000000c8)
int WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit
Constant Value: 300 (0x0000012c)
int WORD_NONE
Tag value for "words" that do not fit into any of other categories. Includes spaces and most punctuation.
Constant Value: 0 (0x00000000)
int WORD_NONE_LIMIT
Upper bound for tags for uncategorized words.
Constant Value: 100 (0x00000064)
int WORD_NUMBER
Tag value for words that appear to be numbers, lower limit.
Constant Value: 100 (0x00000064)
int WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit.
Constant Value: 200 (0x000000c8)
BreakIterator ()
Default constructor. There is no state that is carried by this abstract base class.
Object clone ()
Clone method. Creates another BreakIterator with the same behavior and current state as this one.
Returns | |
---|---|
Object |
The clone. |
int current ()
Return the iterator's current position.
Returns | |
---|---|
int |
The iterator's current position. |
int first ()
Set the iterator to the first boundary position. This is always the beginning index of the text this iterator iterates over. For example, if the iterator iterates over a whole string, this function will always return 0.
Returns | |
---|---|
int |
The character offset of the beginning of the stretch of text being broken. |
int following (int offset)
Sets the iterator's current iteration position to be the first boundary position following the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the first boundary after the specified position.) If the specified position is the past-the-end position, returns DONE.
Parameters | |
---|---|
offset |
int :
The character position to start searching from. |
Returns | |
---|---|
int |
The position of the first boundary position following "offset" (whether or not "offset" itself is a boundary position), or DONE if "offset" is the past-the-end offset. |
Locale[] getAvailableLocales ()
Returns a list of locales for which BreakIterators can be used.
Returns | |
---|---|
Locale[] |
An array of Locales. All of the locales in the array can be used when creating a BreakIterator. |
BreakIterator getCharacterInstance (ULocale where)
[icu] Returns a new instance of BreakIterator that locates logical-character boundaries.
Parameters | |
---|---|
where |
ULocale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates logical-character boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getCharacterInstance (Locale where)
Returns a new instance of BreakIterator that locates logical-character boundaries.
Parameters | |
---|---|
where |
Locale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates logical-character boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getCharacterInstance ()
Returns a new instance of BreakIterator that locates logical-character boundaries. This function assumes that the text being analyzed is in the default locale's language.
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates logical-character boundaries. |
BreakIterator getLineInstance (Locale where)
Returns a new instance of BreakIterator that locates legal line- wrapping positions.
Parameters | |
---|---|
where |
Locale :
A Locale specifying the language of the text being broken. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates legal line-wrapping positions. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getLineInstance (ULocale where)
[icu] Returns a new instance of BreakIterator that locates legal line- wrapping positions.
Parameters | |
---|---|
where |
ULocale :
A Locale specifying the language of the text being broken. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates legal line-wrapping positions. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getLineInstance ()
Returns a new instance of BreakIterator that locates legal line- wrapping positions. This function assumes the text being broken is in the default locale's language.
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates legal line-wrapping positions. |
int getRuleStatus ()
For RuleBasedBreakIterators, return the status tag from the break rule that determined the most recently returned break position.
For break iterator types that do not support a rule status, a default value of 0 is returned.
Returns | |
---|---|
int |
The status from the break rule that determined the most recently returned break position. |
int getRuleStatusVec (int[] fillInArray)
For RuleBasedBreakIterators, get the status (tag) values from the break rule(s) that determined the most recently returned break position.
For break iterator types that do not support rule status, no values are returned.
If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.
Parameters | |
---|---|
fillInArray |
int :
an array to be filled in with the status values. |
Returns | |
---|---|
int |
The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned. |
BreakIterator getSentenceInstance (Locale where)
Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters | |
---|---|
where |
Locale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates sentence boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getSentenceInstance (ULocale where)
[icu] Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters | |
---|---|
where |
ULocale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates sentence boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getSentenceInstance ()
Returns a new instance of BreakIterator that locates sentence boundaries. This function assumes the text being analyzed is in the default locale's language.
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates sentence boundaries. |
CharacterIterator getText ()
Returns a CharacterIterator over the text being analyzed. For at least some subclasses of BreakIterator, this is a reference to the actual iterator being used by the BreakIterator, and therefore, this function's return value should be treated as const. No guarantees are made about the current position of this iterator when it is returned. If you need to move that position to examine the text, clone this function's return value first.
Returns | |
---|---|
CharacterIterator |
A CharacterIterator over the text being analyzed. |
BreakIterator getTitleInstance (Locale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.getWordInstance()
Parameters | |
---|---|
where |
Locale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates title boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getTitleInstance (ULocale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.getWordInstance()
Parameters | |
---|---|
where |
ULocale :
A Locale specifying the language of the text being analyzed. |
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates title boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null. |
BreakIterator getTitleInstance ()
[icu] Returns a new instance of BreakIterator that locates title boundaries.
This function assumes the text being analyzed is in the default locale's
language. The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use a word boundary iterator. getWordInstance()
Returns | |
---|---|
BreakIterator |
A new instance of BreakIterator that locates title boundaries. |
BreakIterator getWordInstance (Locale where)
Returns a new instance of BreakIterator that locates word boundaries.
Parameters | |
---|---|
where |
Locale :
A locale specifying the language of the text to be
analyzed. |
Returns | |
---|---|
BreakIterator |
An instance of BreakIterator that locates word boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getWordInstance (ULocale where)
[icu] Returns a new instance of BreakIterator that locates word boundaries.
Parameters | |
---|---|
where |
ULocale :
A locale specifying the language of the text to be
analyzed. |
Returns | |
---|---|
BreakIterator |
An instance of BreakIterator that locates word boundaries. |
Throws | |
---|---|
NullPointerException |
if where is null.
|
BreakIterator getWordInstance ()
Returns a new instance of BreakIterator that locates word boundaries. This function assumes that the text being analyzed is in the default locale's language.
Returns | |
---|---|
BreakIterator |
An instance of BreakIterator that locates word boundaries. |
boolean isBoundary (int offset)
Return true if the specified position is a boundary position. If the function returns true, the current iteration position is set to the specified position; if the function returns false, the current iteration position is set as though following() had been called.
Parameters | |
---|---|
offset |
int :
the offset to check. |
Returns | |
---|---|
boolean |
True if "offset" is a boundary position. |
int last ()
Set the iterator to the last boundary position. This is always the "past-the-end" index of the text this iterator iterates over. For example, if the iterator iterates over a whole string (call it "text"), this function will always return text.length().
Returns | |
---|---|
int |
The character offset of the end of the stretch of text being broken. |
int next ()
Advances the iterator forward one boundary. The current iteration position is updated to point to the next boundary position after the current position, and this is also the value that is returned. If the current position is equal to the value returned by last(), or to DONE, this function returns DONE and sets the current position to DONE.
Returns | |
---|---|
int |
The position of the first boundary position following the iteration position. |
int next (int n)
Move the iterator by the specified number of steps in the text. A positive number moves the iterator forward; a negative number moves the iterator backwards. If this causes the iterator to move off either end of the text, this function returns DONE; otherwise, this function returns the position of the appropriate boundary. Calling this function is equivalent to calling next() or previous() n times.
Parameters | |
---|---|
n |
int :
The number of boundaries to advance over (if positive, moves
forward; if negative, moves backwards). |
Returns | |
---|---|
int |
The position of the boundary n boundaries from the current iteration position, or DONE if moving n boundaries causes the iterator to advance off either end of the text. |
int preceding (int offset)
Sets the iterator's current iteration position to be the last boundary position preceding the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the last boundary before the specified position.) If the specified position is the starting position, returns DONE.
Parameters | |
---|---|
offset |
int :
The character position to start searching from. |
Returns | |
---|---|
int |
The position of the last boundary position preceding "offset" (whether of not "offset" itself is a boundary position), or DONE if "offset" is the starting offset of the iterator. |
int previous ()
Move the iterator backward one boundary. The current iteration position is updated to point to the last boundary position before the current position, and this is also the value that is returned. If the current position is equal to the value returned by first(), or to DONE, this function returns DONE and sets the current position to DONE.
Returns | |
---|---|
int |
The position of the last boundary position preceding the iteration position. |
void setText (String newText)
Sets the iterator to analyze a new piece of text. The new piece of text is passed in as a String, and the current iteration position is reset to the beginning of the string. (The old text is dropped.)
Parameters | |
---|---|
newText |
String :
A String containing the text to analyze with
this BreakIterator.
|
void setText (CharacterIterator newText)
Sets the iterator to analyze a new piece of text. The BreakIterator is passed a CharacterIterator through which it will access the text itself. The current iteration position is reset to the CharacterIterator's start index. (The old iterator is dropped.)
Parameters | |
---|---|
newText |
CharacterIterator :
A CharacterIterator referring to the text
to analyze with this BreakIterator (the iterator's current
position is ignored, but its other state is significant).
|