Compound word token filter: only_longest_match doesn't work as expected in some scenarios

abdon · August 19, 2019, 11:34am

The only_longest_match option may not quite work how you think it does. This Lucene issue comment has details on how it works:

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

So, I don't think only_longest_match is the way to go here.

One way to prevent certain words from being decompounded is by mapping those words to some "placeholder tokens" that do not get decompounded. The mapping character filter could be used for that. For example, you could create your index like this:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "lightbulb => DO_NOT_DECOMPOUND_1",
            "starlight => DO_NOT_DECOMPOUND_2"
          ]
        }
      },
      "filter": {
        "my_decompounder": {
          "type": "dictionary_decompounder",
          "word_list": [
            "light"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["my_char_filter"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

The character filter will prevent lightbulb and starlight from being decompounded by replacing those words by DO_NOT_DECOMPOUND_1 and DO_NOT_DECOMPOUND_2. You can see how this works by testing the my_analyzer analyzer on lighthouse and lightbulb:

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lighthouse"
}

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lightbulb"
}

You will see that lighthouse does get a token light, but lightbulb does not. And you can test that it works as desired in queries like this:

PUT my_index/_doc/1
{
  "my_field": "lightbulb"
}

PUT my_index/_doc/2
{
  "my_field": "lighthouse"
}

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "light"
    }
  }
}

Topic		Replies	Views
Hyphenation decompounder token filter seems to ignore the `only_longest_match` option Elasticsearch	3	616	February 9, 2022
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Dictionary decompounder: only longest match doesn't work Elasticsearch	3	261	June 3, 2022
Multimatch with CROSS_FIELD query and decompounder Elasticsearch	2	409	March 14, 2022
Phrase Query breaks with "Compound Word Token Filters" Elasticsearch	6	1090	August 13, 2018

Compound word token filter: only_longest_match doesn't work as expected in some scenarios

Related topics