I am using elastic search 2.0 and would like to be able to search for hyphenated terms using a combination of different queries for example for the word anti-emetic it should be possible to search with:
The tricky bit is the "anti emetic" case, since it looks like two different tokens. You can get partially there by normalizing the hyphen case to either the "split" or "merged" case. For example:
That creates an analyzer that will merge hyphens into a single token using the word_delimiter. This makes hyphen search work fine. A search for "anti-emetic" will find all three variations.
The problem is the other two. A search for "anti emetic" will only find ["anti-emetic", "anti emetic"], while a search for "antiemetic" will only find ["anti-emetic", "antiemetic"].
I don't think there is a good resolution to this. If there is a small list of suffixes like this, you could use a char_filter to normalize the text into a hyphen, or perhaps a synonym list. Or if there is a case change (AntiEmetic) word_delimiter can normalize based on that.
But if they are just two different tokens, it's hard for Elasticsearch to know that those tokens are "special" and should be merged, while other random tokens are not.
This is brilliant, many thanks. I've got it working, however I'm having issues with duplicate results with the highlighter. For example searching for "antiemetic" highlights the term in foo as well as foo.merged, is there a way I can deduplicate them to only show one?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.