The only_longest_match
option may not quite work how you think it does. This Lucene issue comment has details on how it works:
The onlyLongestMatch flag currently affects whether all matches or only the longest match should be returned per start character (in DictionaryCompoundWordTokenFilter) or per hyphenation start point (in HyphenationCompoundWordTokenFilter).
Example:
Dictionary "Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft" for input "Wirtschaftswissenschaft" will return the original input plus tokens "Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen". "schaft" is still returned (even twice) because it's the longest token starting at the respective position.
So, I don't think only_longest_match
is the way to go here.
One way to prevent certain words from being decompounded is by mapping those words to some "placeholder tokens" that do not get decompounded. The mapping character filter could be used for that. For example, you could create your index like this:
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"lightbulb => DO_NOT_DECOMPOUND_1",
"starlight => DO_NOT_DECOMPOUND_2"
]
}
},
"filter": {
"my_decompounder": {
"type": "dictionary_decompounder",
"word_list": [
"light"
]
}
},
"analyzer": {
"my_analyzer": {
"char_filter": ["my_char_filter"],
"tokenizer": "standard",
"filter": [
"lowercase",
"my_decompounder"
]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
The character filter will prevent lightbulb
and starlight
from being decompounded by replacing those words by DO_NOT_DECOMPOUND_1
and DO_NOT_DECOMPOUND_2
. You can see how this works by testing the my_analyzer
analyzer on lighthouse
and lightbulb
:
GET my_index/_analyze
{
"analyzer" : "my_analyzer",
"text" : "lighthouse"
}
GET my_index/_analyze
{
"analyzer" : "my_analyzer",
"text" : "lightbulb"
}
You will see that lighthouse
does get a token light
, but lightbulb
does not. And you can test that it works as desired in queries like this:
PUT my_index/_doc/1
{
"my_field": "lightbulb"
}
PUT my_index/_doc/2
{
"my_field": "lighthouse"
}
GET my_index/_search
{
"query": {
"match": {
"my_field": "light"
}
}
}