Compound word token filter: only_longest_match doesn't work as expected in some scenarios

The only_longest_match option may not quite work how you think it does. This Lucene issue comment has details on how it works:

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

So, I don't think only_longest_match is the way to go here.

One way to prevent certain words from being decompounded is by mapping those words to some "placeholder tokens" that do not get decompounded. The mapping character filter could be used for that. For example, you could create your index like this:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "lightbulb => DO_NOT_DECOMPOUND_1",
            "starlight => DO_NOT_DECOMPOUND_2"
          ]
        }
      },
      "filter": {
        "my_decompounder": {
          "type": "dictionary_decompounder",
          "word_list": [
            "light"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["my_char_filter"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

The character filter will prevent lightbulb and starlight from being decompounded by replacing those words by DO_NOT_DECOMPOUND_1 and DO_NOT_DECOMPOUND_2. You can see how this works by testing the my_analyzer analyzer on lighthouse and lightbulb:

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lighthouse"
}

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lightbulb"
}

You will see that lighthouse does get a token light, but lightbulb does not. And you can test that it works as desired in queries like this:

PUT my_index/_doc/1
{
  "my_field": "lightbulb"
}

PUT my_index/_doc/2
{
  "my_field": "lighthouse"
}

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "light"
    }
  }
}
2 Likes