Searching for exactly a hyphenated word

So this problem has stumped me.

I would like to search for instances of (for example) "hyper-space" but not find "hyper space". All traditional tokenizers except whitespace are going to split on the hyphen and it will be lost to the ether as far as indexing goes. I can use a whitespace tokenizer and then hit it with a word delimiter token filter using the 'preserve original' setting. That will get "hyper-space", "hyper" and "space" all indexed, which is great. Now I can run a term query for "hyper-space" and I get precisely what I need.

But. . .

Since I'm using a whitespace tokenizer and "preserve original" the original may also have trailing punctuation, like "hyper-space,". Now the term query won't match because of that trailing comma. Yuck.

All the possible ways of getting around this start using expensive processors in elasticsearch that I would prefer to avoid. Things like filtering by character or using regexp. Is there a more intuitive solution to this problem that I am missing?

Thanks
David

You can use an analyzer with a mapping character filter that replaces any dashes with a character that is not removed by the tokenizer, for example an underscore.

For example, given this index:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "- => _"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

You can now only find a document containing hyper-space if you search for hyper-space with a dash:

# Test the analyzer
GET my_index/_analyze
{
  "text": "foo hyper-space bar",
  "analyzer": "my_analyzer"
}

# Index a document containing "hyper-space"
PUT my_index/_doc/1
{
  "my_field": "foo hyper-space bar"
}

# A query for just "hyper" does not return any hits
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper"
    }
  }
}

# A query for "hyper space" (without a dash) does not return any hits either
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper space"
    }
  }
}

# A query for "hyper-space" with a dash does return our document
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper-space"
    }
  }
}
1 Like

Thanks! I considered something like this, but I was wondering how expensive a character filter is. I've never included one because it feels like pretty heavy pre-processing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.