StringSearch is a
SearchIterator that provides
language-sensitive text searching based on the comparison rules defined
in a
RuleBasedCollator object.
StringSearch ensures that language eccentricity can be
handled, e.g. for the German collator, characters ß and SS will be matched
if case is chosen to be ignored.
See the
"ICU Collation Design Document" for more information.
There are 2 match options for selection:
Let S' be the sub-string of a text string S between the offsets start and
end [start, end].
A pattern string P matches a text string S at the offsets [start, end]
if
option 1. Some canonical equivalent of P matches some canonical equivalent
of S'
option 2. P matches S' and if P starts or ends with a combining mark,
there exists no non-ignorable combining mark before or after S?
in S respectively.
Option 2. is the default.
This search has APIs similar to that of other text iteration mechanisms
such as the break iterators in
BreakIterator. Using these
APIs, it is easy to scan through text looking for all occurrences of
a given pattern. This search iterator allows changing of direction by
calling a
#reset followed by a
#next or
#previous.
Though a direction change can occur without calling
#reset first,
this operation comes with some speed penalty.
Match results in the forward direction will match the result matches in
the backwards direction in the reverse order
SearchIterator provides APIs to specify the starting position
within the text string to be searched, e.g.
SearchIterator#setIndex,
SearchIterator#preceding and
SearchIterator#following.
Since the starting position will be set as it is specified, please take note that
there are some danger points at which the search may render incorrect
results:
- In the midst of a substring that requires normalization.
- If the following match is to be found, the position should not be the
second character which requires swapping with the preceding
character. Vice versa, if the preceding match is to be found, the
position to search from should not be the first character which
requires swapping with the next character. E.g certain Thai and
Lao characters require swapping.
- If a following pattern match is to be found, any position within a
contracting sequence except the first will fail. Vice versa if a
preceding pattern match is to be found, an invalid starting point
would be any character within a contracting sequence except the last.
A
BreakIterator can be used if only matches at logical breaks are desired.
Using a
BreakIterator will only give you results that exactly matches the
boundaries given by the
BreakIterator. For instance the pattern "e" will
not be found in the string "\u00e9" if a character break iterator is used.
Options are provided to handle overlapping matches.
E.g. In English, overlapping matches produces the result 0 and 2
for the pattern "abab" in the text "ababab", where mutually
exclusive matches only produces the result of 0.
Options are also provided to implement "asymmetric search" as described in
UTS #10 Unicode Collation Algorithm, specifically the ElementComparisonType
values.
Though collator attributes will be taken into consideration while
performing matches, there are no APIs here for setting and getting the
attributes. These attributes can be set by getting the collator
from
#getCollator and using the APIs in
RuleBasedCollator.
Lastly to update StringSearch to the new collator attributes,
#reset has to be called.
Restriction:
Currently there are no composite characters that consists of a
character with combining class > 0 before a character with combining
class == 0. However, if such a character exists in the future,
StringSearch does not guarantee the results for option 1.
Consult the
SearchIterator documentation for information on
and examples of how to use instances of this class to implement text
searching.
Note, StringSearch is not to be subclassed.