Compound word token filter: only_longest_match doesn't work as expected in some scenarios

Hi,

I work on a german product search. To make things more clear I "translated" my problem to the following case:

lets assume I have four documents each containing one of the following words in my index:

"starlight"
"moonlight"
"lighthouse"
"lightbulb"

If I search now for "light", I of course won't find any.

So I define two (identical) analyzers that are used for the respective field for indexation and search. These analyzers only consists of a "dictionary_decompounder" with the "word_list" : ["light"].

So now of course I will find all 4 documents. If I now like to exclude "lightbulb" and "starlight" from the "light" search results, I add these two words to the "word_list" : ["light","lightbulb","starlight"] and activate "only_longest_match": "true".

Now I would expect only the "moonlight" and "lighthouse" documents if I search for "light" again.

Weirdly the "lightbulb" disappears as desired now, but I still get the "starlight" document returned (in addition to the expected "moonlight" and "lighthouse" documents).

This seems to be the case because in "lightbulb" "light" is at the start of the word, where in "starlight" it's at the end.

Does anyone have any idea how I can ensure that "starlight" doesn't show up when I search for "light" in this scenario?

Thanks

The only_longest_match option may not quite work how you think it does. This Lucene issue comment has details on how it works:

The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).

Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.

So, I don't think only_longest_match is the way to go here.

One way to prevent certain words from being decompounded is by mapping those words to some "placeholder tokens" that do not get decompounded. The mapping character filter could be used for that. For example, you could create your index like this:

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "lightbulb => DO_NOT_DECOMPOUND_1",
            "starlight => DO_NOT_DECOMPOUND_2"
          ]
        }
      },
      "filter": {
        "my_decompounder": {
          "type": "dictionary_decompounder",
          "word_list": [
            "light"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["my_char_filter"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

The character filter will prevent lightbulb and starlight from being decompounded by replacing those words by DO_NOT_DECOMPOUND_1 and DO_NOT_DECOMPOUND_2. You can see how this works by testing the my_analyzer analyzer on lighthouse and lightbulb:

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lighthouse"
}

GET my_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "lightbulb"
}

You will see that lighthouse does get a token light, but lightbulb does not. And you can test that it works as desired in queries like this:

PUT my_index/_doc/1
{
  "my_field": "lightbulb"
}

PUT my_index/_doc/2
{
  "my_field": "lighthouse"
}

GET my_index/_search
{
  "query": {
    "match": {
      "my_field": "light"
    }
  }
}
2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.