I totally missed this message the first time around.
I think in general the confusion is about "collation" versus
"everything else"? The rest of the stuff in the icu package is
"normal" analysis components but collation is wierd to think about
because you are building binary sort keys instead.
On Wed, Nov 30, 2011 at 12:06 PM, Clinton Gormley clint@traveljury.com wrote:
- should the tokenizer always be 'keyword' when using the ICU plugin,
as per the examples given on the page the above link points to?
It depends what you are doing. If you are using Collation, definitely,
because you are building a sort key.
- When would you use normalization with nfkc_cf and when would you use
the case-folding filter?
When you say case-folding filter, maybe you are referring to ICUFoldingFilter?
This is not really a case-folding filter (though it does case-fold, it
also does other stuff!, and nfkc_cf case folds too)!.
This is just nfkc_cf PLUS additional stuff.
There is a long list of what this additional stuff actually is here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUFoldingFilter.java,
but in general its stuff like "removing accents".
In general, you would use nfkc_cf as a replacement for LowerCaseFilter
in your application. In most applications that will be displaying text
to the user, nfkc_cf would be considered pretty aggressive: for
example it reduces full-width numbers to ascii numbers, and things
like that. But for search, this kind of stuff isn't aggressive, its
actually pretty conservative.
If you felt like it wasn't doing enough (e.g. you want to remove
accents too), then you could change to the foldingfilter for even more
aggressive normalization. Think of the ICUFoldingfilter as a
replacement for both LowerCaseFilter and ASCIIFoldingFilter.
- Given that sorting can only be done on fields with single terms,
presumably it'd only be useful to use the collation filter by
itself, with the 'keyword' tokenizer. Or would it be feasible
to combine with the case folding filter?
You could case-fold before collation, but there is no need to do this.
Its also dangerous (depending upon what your collation rules are),
because some locales have special case rules (but nfkc_cf is
"generic").
Instead, set properties like strength on the collator to determine if
it should build case-insensitive sort keys (and let the collator
handle it totally).
- How would you combine these filters with language stemmers?
which filters? I dont think stemming makes sense for sort keys. So I
don't see it combined with collation.
For the tokenizer, normalization filter, etc, just use swap in with
your existing tokenizer in your chain if you want to try it out.
instead of StandardTokenizer + LowerCaseFilter + EnglishStemmer, you
could use ICUTokenizer + ICUNormalizer2Filter + EnglishStemmer.
- How would the collations work if each document can have its own
language?
that sounds more like an application problem... The question is
really, how do you want the sort to work? Whose rules do you want it
to use?
If you absolutely MUST sort across different languages and just want
the best sort that makes the least people angry, then use the root
locale (this is the empty string).
...plus other stuff I haven't thought of
If you were able to produce an example, it'd be greatly appreciated.
For collation or for using all the icu filters in general?
For collation we have these examples:
http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/collation/package-summary.html
http://wiki.apache.org/solr/UnicodeCollation
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/analysis-extras/src/test/org/apache/solr/analysis/TestICUCollationKeyFilterFactory.java
The icu documentation is really good:
ICU User Guide | ICU Documentation, and they have a cool demo
that lets you investigate sort rules for different locales/different
options visually online. go to
http://demo.icu-project.org/icu-bin/locexp, pick a locale and hit the
demo button under Collation rules.
For the tokenizer, its just a drop in replacement for
StandardTokenizer. The Normalizer2Filter with nfkc_cf can be seen as a
replacement for LowerCaseFilter. The FoldingFilter can be seen as a
replacement for AsciiFoldingFilter, etc.
--