Compound word token filter: only_longest_match doesn't work as expected in some scenarios

andrebarthelmes · August 16, 2019, 10:11am

Hi,

I work on a german product search. To make things more clear I "translated" my problem to the following case:

lets assume I have four documents each containing one of the following words in my index:

"starlight"
"moonlight"
"lighthouse"
"lightbulb"

If I search now for "light", I of course won't find any.

So I define two (identical) analyzers that are used for the respective field for indexation and search. These analyzers only consists of a "dictionary_decompounder" with the "word_list" : ["light"].

So now of course I will find all 4 documents. If I now like to exclude "lightbulb" and "starlight" from the "light" search results, I add these two words to the "word_list" : ["light","lightbulb","starlight"] and activate "only_longest_match": "true".

Now I would expect only the "moonlight" and "lighthouse" documents if I search for "light" again.

Weirdly the "lightbulb" disappears as desired now, but I still get the "starlight" document returned (in addition to the expected "moonlight" and "lighthouse" documents).

This seems to be the case because in "lightbulb" "light" is at the start of the word, where in "starlight" it's at the end.

Does anyone have any idea how I can ensure that "starlight" doesn't show up when I search for "light" in this scenario?

Thanks

abdon · August 19, 2019, 11:34am

The only_longest_match option may not quite work how you think it does. This Lucene issue comment has details on how it works:

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

So, I don't think only_longest_match is the way to go here.

One way to prevent certain words from being decompounded is by mapping those words to some "placeholder tokens" that do not get decompounded. The mapping character filter could be used for that. For example, you could create your index like this:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "lightbulb => DO_NOT_DECOMPOUND_1",
            "starlight => DO_NOT_DECOMPOUND_2"
          ]
        }
      },
      "filter": {
        "my_decompounder": {
          "type": "dictionary_decompounder",
          "word_list": [
            "light"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["my_char_filter"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

The character filter will prevent lightbulb and starlight from being decompounded by replacing those words by DO_NOT_DECOMPOUND_1 and DO_NOT_DECOMPOUND_2. You can see how this works by testing the my_analyzer analyzer on lighthouse and lightbulb:

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lighthouse"
}

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lightbulb"
}

You will see that lighthouse does get a token light, but lightbulb does not. And you can test that it works as desired in queries like this:

PUT my_index/_doc/1
{
  "my_field": "lightbulb"
}

PUT my_index/_doc/2
{
  "my_field": "lighthouse"
}

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "light"
    }
  }
}

system · September 16, 2019, 11:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hyphenation decompounder token filter seems to ignore the `only_longest_match` option Elasticsearch	3	616	February 9, 2022
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Dictionary decompounder: only longest match doesn't work Elasticsearch	3	261	June 3, 2022
Multimatch with CROSS_FIELD query and decompounder Elasticsearch	2	409	March 14, 2022
Phrase Query breaks with "Compound Word Token Filters" Elasticsearch	6	1090	August 13, 2018

Compound word token filter: only_longest_match doesn't work as expected in some scenarios

Related topics