Custom Tokenization


(Michael Bleigh) #1

I'm a bit new to ElasticSearch so my apologies if this is an obvious thing.
I'm trying to build an index that will match tag-like structures in a
consistent fashion. It's important not to do fuzzy matching (it needs to be
exact) but I also need to support a variety of representations.

What I'm wondering is how I would be able to create an analyzer such that
these phrases (as an example) would all match identically when queried:

"Heart and Soul", "heart_and_soul", "Heart & Soul", "heart-and-soul"

However, I would not want "heart" to match as an example of the specificity
that I mentioned.

What is the best way to achieve this? It seems like creating a
pattern_replace filter could potentially help me, but I can't seem to find
the documentation for that filter anyway. Help would be much appreciated.

I'm using Ruby and Tire, so bonus points if the explanation is in Ruby but
not required.

Thanks!


(Karussell) #2

If you cannot find docs go to https://github.com/elasticsearch/elasticsearch/
and type 't' followed by you filter 'patternreplace'

this gives you

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/apache/lucene/analysis/pattern/PatternReplaceFilter.java

https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/analysis/PatternReplaceTokenFilterFactory.java

so you'll need 'pattern' and 'replacement' and you can use 'flags' and
'all'.

Peter.

On 12 Jan., 22:51, Michael Bleigh mble...@gmail.com wrote:

I'm a bit new to ElasticSearch so my apologies if this is an obvious thing.
I'm trying to build an index that will match tag-like structures in a
consistent fashion. It's important not to do fuzzy matching (it needs to be
exact) but I also need to support a variety of representations.

What I'm wondering is how I would be able to create an analyzer such that
these phrases (as an example) would all match identically when queried:

"Heart and Soul", "heart_and_soul", "Heart & Soul", "heart-and-soul"

However, I would not want "heart" to match as an example of the specificity
that I mentioned.

What is the best way to achieve this? It seems like creating a
pattern_replace filter could potentially help me, but I can't seem to find
the documentation for that filter anyway. Help would be much appreciated.

I'm using Ruby and Tire, so bonus points if the explanation is in Ruby but
not required.

Thanks!


(system) #3