PatternTokenizer?

Hello,

I have encountered the following ElasticSearch "puzzle":

  1. I would like my analyzer to use ASCII Folding token filter (to make
    search work properly in Polish).

  2. I need fine-grained control for the way tokens are split. Therefore
    I use PatternAnalyzer.

I cannot combine these two, because (correct me if I am wrong) the
only analyzer that allows customization of filters is CustomAnalyzer
and I cannot add ASCI Folding filer to PatternAnalyzer.

I guess that having PatternTokenizer (not analyzer) would solve my
problem. Actually, there is (private) class
PatternAnalyzer.PatternTokenizer in Lucene.

Would it make sense to add PatternTokenizer to ElasticSearch? Or is
there any other way to solve issue?

Regards,
-Pawel Wrzeszcz

On Fri, Feb 18, 2011 at 5:47 AM, Pawel Wrzeszcz
pawel.wrzeszcz@gmail.com wrote:

I guess that having PatternTokenizer (not analyzer) would solve my
problem. Actually, there is (private) class
PatternAnalyzer.PatternTokenizer in Lucene.

Would it make sense to add PatternTokenizer to Elasticsearch? Or is
there any other way to solve issue?

Hi, in lucene's trunk there are some cleaner pattern-based components
that replaced this PatternAnalyzer: a PatternTokenizer,
PatternTokenFilter, and PatternCharFilter. These used to be in Solr
but in Lucene's trunk all analysis components are merged into this
single module.

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/

(Note, for performance reasons some of these use new features of
upcoming Lucene 3.1's analysis API, but maybe would still be easier to
start from)