Searching for exactly a hyphenated word

Daverino · December 17, 2018, 10:03pm

So this problem has stumped me.

I would like to search for instances of (for example) "hyper-space" but not find "hyper space". All traditional tokenizers except whitespace are going to split on the hyphen and it will be lost to the ether as far as indexing goes. I can use a whitespace tokenizer and then hit it with a word delimiter token filter using the 'preserve original' setting. That will get "hyper-space", "hyper" and "space" all indexed, which is great. Now I can run a term query for "hyper-space" and I get precisely what I need.

But. . .

Since I'm using a whitespace tokenizer and "preserve original" the original may also have trailing punctuation, like "hyper-space,". Now the term query won't match because of that trailing comma. Yuck.

All the possible ways of getting around this start using expensive processors in elasticsearch that I would prefer to avoid. Things like filtering by character or using regexp. Is there a more intuitive solution to this problem that I am missing?

Thanks
David

abdon · December 18, 2018, 10:57am

You can use an analyzer with a mapping character filter that replaces any dashes with a character that is not removed by the tokenizer, for example an underscore.

For example, given this index:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "- => _"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

You can now only find a document containing hyper-space if you search for hyper-space with a dash:

# Test the analyzer
GET my_index/_analyze
{
  "text": "foo hyper-space bar",
  "analyzer": "my_analyzer"
}

# Index a document containing "hyper-space"
PUT my_index/_doc/1
{
  "my_field": "foo hyper-space bar"
}

# A query for just "hyper" does not return any hits
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper"
    }
  }
}

# A query for "hyper space" (without a dash) does not return any hits either
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper space"
    }
  }
}

# A query for "hyper-space" with a dash does return our document
GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "hyper-space"
    }
  }
}

Daverino · December 18, 2018, 3:50pm

Thanks! I considered something like this, but I was wondering how expensive a character filter is. I've never included one because it feels like pretty heavy pre-processing.

system · January 15, 2019, 3:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filter terms tags with hyphen and whitespace Elasticsearch	4	2627	July 6, 2017
Keyword analyzer but allow redundant white spaces Elasticsearch	3	4092	January 15, 2018
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1217	November 27, 2019
Search with whitespace again Elasticsearch	3	5241	July 6, 2017
Removing whitespace around a delimiter in a custom anaylzer Elasticsearch	12	3103	July 6, 2017

Searching for exactly a hyphenated word

Related topics