What does the rbbi rule for icu_tokenizer in the docs do?

(Judy Raj) #1

I was trying to work with the icu tokenizer and came across the rbbi rules section in the docs.
I used the same rule from the docs .+ {200}; for Arabic but my result looks nothing like the one in the docs.
Could someone perhaps explain what does this .+ {200}; mean?

(Jörg Prante) #2

Unicode has a concept of scripts, e.g. "Latn", "Cyrl", "Arab" etc. See ISO 15924 and http://www.unicode.org/reports/tr24/

The ICU tokenizer can be configured to use a custom break iterator for segmentation of words. One type of break iterator is a rule-based break iterator (RBBI). The set of rules are typically for a script. Thus, loading of RBBI rules are specified like "Latn:name.rbbi".

More information about ICU rules for finding boundary positions within text can be found at http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

So, as a very basic example, the KeywordTokenizer.rbbi

consists of the single rule ".+ {200}", which means, "take latin script, take one character, no matter what character, and repeat for all characters in the input until there is no more left, and map the result to token class 200 (which stands for "alphanumerical tokens")".

This rule effectively ignores word boundaries and takes the whole input as a single token (which is equivalent to the Lucene KeywordTokenizer).

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.