Searching with combinations of space and with and without hypens

Imran_Azad · December 14, 2015, 12:09pm

I am using elastic search 2.0 and would like to be able to search for hyphenated terms using a combination of different queries for example for the word anti-emetic it should be possible to search with:

anti-emetic
antiemetic
anti emetic

What's the best way to achieve this?

polyfractal · December 14, 2015, 6:36pm

The tricky bit is the "anti emetic" case, since it looks like two different tokens. You can get partially there by normalizing the hyphen case to either the "split" or "merged" case. For example:

PUT /test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "merged_hyphens": {
               "tokenizer": "whitespace",
               "filter": [ "lowercase", "word_delim"]
            }
         },
         "filter": {
            "word_delim": {
               "type": "word_delimiter",
               "catenate_words": true,
               "generate_word_parts": false,
               "generate_number_parts": false
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "foo": {
               "type": "string",
               "fields": {
                  "merged": {
                     "type": "string",
                     "analyzer": "merged_hyphens"
                  }
               }
            }
         }
      }
   }
}

PUT /test/test/1
{
    "foo": "anti-emetic"
}
PUT /test/test/2
{
    "foo": "antiemetic"
}
PUT /test/test/3
{
    "foo": "anti emetic"
}

GET /test/_search
{
    "query": {
        "multi_match": {
           "query": "anti-emetic",
           "fields": ["foo", "foo.merged"]
        }
    }
}

That creates an analyzer that will merge hyphens into a single token using the word_delimiter. This makes hyphen search work fine. A search for "anti-emetic" will find all three variations.

The problem is the other two. A search for "anti emetic" will only find ["anti-emetic", "anti emetic"], while a search for "antiemetic" will only find ["anti-emetic", "antiemetic"].

I don't think there is a good resolution to this. If there is a small list of suffixes like this, you could use a char_filter to normalize the text into a hyphen, or perhaps a synonym list. Or if there is a case change (AntiEmetic) word_delimiter can normalize based on that.

But if they are just two different tokens, it's hard for Elasticsearch to know that those tokens are "special" and should be merged, while other random tokens are not.

Imran_Azad · December 15, 2015, 11:36am

This is brilliant, many thanks. I've got it working, however I'm having issues with duplicate results with the highlighter. For example searching for "antiemetic" highlights the term in foo as well as foo.merged, is there a way I can deduplicate them to only show one?

Mallikarjuna_J_S · April 11, 2017, 2:20pm

Along with @polyfractal 's answer , you can refer to this link to achieve search for words with and without space.

Topic		Replies	Views
Searching for exactly a hyphenated word Elasticsearch	3	11979	January 15, 2019
Filter terms tags with hyphen and whitespace Elasticsearch	4	2623	July 6, 2017
How to search for terms containing hyphen (-) on _all field? Elasticsearch	3	25251	May 4, 2017
Search for word without spaces if query contains spaces Elasticsearch	3	3533	July 5, 2017
Dealing with concatenated words in elasticsearch Elasticsearch	1	71	March 31, 2024

Searching with combinations of space and with and without hypens

Related topics