Handling hyphenated words like "e-mail" with tokenizer/stemmer

David_Steiner · November 16, 2016, 7:24pm

Ideally, I would like to be able to stem "e-mail" to "email" so that whether someone has typed it one way or the other, a search in the stemmed field will find it. Currently, it appears the standard tokenizer changes "e-mail" into "e" and "mail", which isn't good because now I can't stem "e-mail" and now "mail" and "e-mail" are combined. I can't use "whitespace" as a tokenizer because if the text I'm indexing contains other punctuation, such as "He said, 'send me an e-mail!'", "send" isn't even indexed correctly - it's indexed as "'send".
So, I guess I'd like to be able to tokenize in a standard way except that I want hyphens retained so that I have a chance to stem hyphenated words. Is there a way to do this? I hoping that if it involves the "pattern" tokenizer that there's already a known pattern for this?

jprante · November 16, 2016, 9:08pm

I wrote a hyphen tokenizer for this, for example https://github.com/jprante/elasticsearch-plugin-bundle/blob/master/src/test/java/org/xbib/elasticsearch/index/analysis/hyphen/HyphenTokenizerTests.java

The tokenizer is bundled with all my other analyzers/tokenizers/token filters in a plugin, see

David_Steiner · November 16, 2016, 9:32pm

I'm using elasticsearch 5.0 (and I'm new to elasticsearch). Will this work with 5.0?

jprante · November 16, 2016, 9:34pm

I see. I need to update my project.

David_Steiner · November 29, 2016, 8:52pm

I'm assuming that you're reply means that what you have here will not work with 5.0 as it is?

system · December 27, 2016, 8:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom Elasticsearch Analyzer :- Tokenization and Detokenize Text Processing Elasticsearch	0	68	May 27, 2024
Searching for exactly a hyphenated word Elasticsearch	3	12051	January 15, 2019
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2218	July 6, 2017
Pattern_replace char filter regex Elasticsearch	2	707	June 28, 2017
ElasticSearch standard Analyzer - exceptional case Elasticsearch	10	1026	January 10, 2018

Handling hyphenated words like "e-mail" with tokenizer/stemmer

Related topics