Using PatternTokenizer


(ppearcy) #1

Hello,
Is it correct that in order to use the PatternTokenizer, one would
need to implement a plugin similar to icu?

Thanks,
Paul


(Shay Banon) #2

Yes, but it can be part of the built in analyzers in elasticsearch (I assume
you refer to the one in Lucene).

-shay.banon

On Sun, Jul 25, 2010 at 12:28 PM, Paul ppearcy@gmail.com wrote:

Hello,
Is it correct that in order to use the PatternTokenizer, one would
need to implement a plugin similar to icu?

Thanks,
Paul


(Shay Banon) #3

Add this: http://github.com/elasticsearch/elasticsearch/issues/issue/276.

On Sun, Jul 25, 2010 at 9:50 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, but it can be part of the built in analyzers in elasticsearch (I
assume you refer to the one in Lucene).

-shay.banon

On Sun, Jul 25, 2010 at 12:28 PM, Paul ppearcy@gmail.com wrote:

Hello,
Is it correct that in order to use the PatternTokenizer, one would
need to implement a plugin similar to icu?

Thanks,
Paul


(ppearcy) #4

Yeah, it probably makes sense to have it built in. I'd be happy to
create a fork and submit it. Would plan on exposing the pattern,
lowercase, and stopwords options that map directly to Lucene's
PatternAnalyzer inputs.

A separate pattern tokenizer would be nice to combine with other
options, but that doesn't appear to exist in Lucene (though Solr has a
more flexible version based on regex grouping that will probably be
available with the Lucene/Solr merge). Not that it would be hard to
write, just don't need it for my use case.

Thanks,
Paul

On Jul 25, 12:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, but it can be part of the built in analyzers in elasticsearch (I assume
you refer to the one in Lucene).

-shay.banon

On Sun, Jul 25, 2010 at 12:28 PM, Paul ppea...@gmail.com wrote:

Hello,
Is it correct that in order to use the PatternTokenizer, one would
need to implement a plugin similar to icu?

Thanks,
Paul


(ppearcy) #5

Huh, somehow the Nabble (which shows your response referencing
http://github.com/elasticsearch/elasticsearch/issues/issue/276) and
google groups which doesn't are out of sync.

Anyway, thanks a ton! Seems straight forward and I'll let you know if
there are any issues.

Best Regards,
Paul

On Jul 25, 5:16 pm, Paul ppea...@gmail.com wrote:

Yeah, it probably makes sense to have it built in. I'd be happy to
create a fork and submit it. Would plan on exposing the pattern,
lowercase, and stopwords options that map directly to Lucene's
PatternAnalyzer inputs.

A separate pattern tokenizer would be nice to combine with other
options, but that doesn't appear to exist in Lucene (though Solr has a
more flexible version based on regex grouping that will probably be
available with the Lucene/Solr merge). Not that it would be hard to
write, just don't need it for my use case.

Thanks,
Paul

On Jul 25, 12:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, but it can be part of the built in analyzers in elasticsearch (I assume
you refer to the one in Lucene).

-shay.banon

On Sun, Jul 25, 2010 at 12:28 PM, Paul ppea...@gmail.com wrote:

Hello,
Is it correct that in order to use the PatternTokenizer, one would
need to implement a plugin similar to icu?

Thanks,
Paul


(system) #6