Searching with combinations of space and with and without hypens

I am using elastic search 2.0 and would like to be able to search for hyphenated terms using a combination of different queries for example for the word anti-emetic it should be possible to search with:

  • anti-emetic
  • antiemetic
  • anti emetic

What's the best way to achieve this?

The tricky bit is the "anti emetic" case, since it looks like two different tokens. You can get partially there by normalizing the hyphen case to either the "split" or "merged" case. For example:

PUT /test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "merged_hyphens": {
               "tokenizer": "whitespace",
               "filter": [ "lowercase", "word_delim"]
            }
         },
         "filter": {
            "word_delim": {
               "type": "word_delimiter",
               "catenate_words": true,
               "generate_word_parts": false,
               "generate_number_parts": false
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "foo": {
               "type": "string",
               "fields": {
                  "merged": {
                     "type": "string",
                     "analyzer": "merged_hyphens"
                  }
               }
            }
         }
      }
   }
}

PUT /test/test/1
{
    "foo": "anti-emetic"
}
PUT /test/test/2
{
    "foo": "antiemetic"
}
PUT /test/test/3
{
    "foo": "anti emetic"
}

GET /test/_search
{
    "query": {
        "multi_match": {
           "query": "anti-emetic",
           "fields": ["foo", "foo.merged"]
        }
    }
}

That creates an analyzer that will merge hyphens into a single token using the word_delimiter. This makes hyphen search work fine. A search for "anti-emetic" will find all three variations.

The problem is the other two. A search for "anti emetic" will only find ["anti-emetic", "anti emetic"], while a search for "antiemetic" will only find ["anti-emetic", "antiemetic"].

I don't think there is a good resolution to this. If there is a small list of suffixes like this, you could use a char_filter to normalize the text into a hyphen, or perhaps a synonym list. Or if there is a case change (AntiEmetic) word_delimiter can normalize based on that.

But if they are just two different tokens, it's hard for Elasticsearch to know that those tokens are "special" and should be merged, while other random tokens are not.

2 Likes

This is brilliant, many thanks. I've got it working, however I'm having issues with duplicate results with the highlighter. For example searching for "antiemetic" highlights the term in foo as well as foo.merged, is there a way I can deduplicate them to only show one?

Along with @polyfractal 's answer , you can refer to this link to achieve search for words with and without space.