Handling hyphenated words like "e-mail" with tokenizer/stemmer

(David Steiner) #1

Ideally, I would like to be able to stem "e-mail" to "email" so that whether someone has typed it one way or the other, a search in the stemmed field will find it. Currently, it appears the standard tokenizer changes "e-mail" into "e" and "mail", which isn't good because now I can't stem "e-mail" and now "mail" and "e-mail" are combined. I can't use "whitespace" as a tokenizer because if the text I'm indexing contains other punctuation, such as "He said, 'send me an e-mail!'", "send" isn't even indexed correctly - it's indexed as "'send".
So, I guess I'd like to be able to tokenize in a standard way except that I want hyphens retained so that I have a chance to stem hyphenated words. Is there a way to do this? I hoping that if it involves the "pattern" tokenizer that there's already a known pattern for this?

(Jörg Prante) #2

I wrote a hyphen tokenizer for this, for example https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/test/java/org/xbib/elasticsearch/index/analysis/hyphen/HyphenTokenizerTests.java

The tokenizer is bundled with all my other analyzers/tokenizers/token filters in a plugin, see

(David Steiner) #3

I'm using elasticsearch 5.0 (and I'm new to elasticsearch). Will this work with 5.0?

(Jörg Prante) #4

I see. I need to update my project.

(David Steiner) #5

I'm assuming that you're reply means that what you have here will not work with 5.0 as it is?

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.