Handling hyphenated words like "e-mail" with tokenizer/stemmer

Ideally, I would like to be able to stem "e-mail" to "email" so that whether someone has typed it one way or the other, a search in the stemmed field will find it. Currently, it appears the standard tokenizer changes "e-mail" into "e" and "mail", which isn't good because now I can't stem "e-mail" and now "mail" and "e-mail" are combined. I can't use "whitespace" as a tokenizer because if the text I'm indexing contains other punctuation, such as "He said, 'send me an e-mail!'", "send" isn't even indexed correctly - it's indexed as "'send".
So, I guess I'd like to be able to tokenize in a standard way except that I want hyphens retained so that I have a chance to stem hyphenated words. Is there a way to do this? I hoping that if it involves the "pattern" tokenizer that there's already a known pattern for this?

I wrote a hyphen tokenizer for this, for example https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/test/java/org/xbib/elasticsearch/index/analysis/hyphen/HyphenTokenizerTests.java

The tokenizer is bundled with all my other analyzers/tokenizers/token filters in a plugin, see

I'm using elasticsearch 5.0 (and I'm new to elasticsearch). Will this work with 5.0?

I see. I need to update my project.

I'm assuming that you're reply means that what you have here will not work with 5.0 as it is?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.