We are evaluating the use of Elastic's "Compound Word Token Filters" and @jprante's "Decompound Plugin" for a large index of German documents.
So far both work fine. The one of @jprante works even a little better.
The problem is that the phrase-query of elasticsearch breaks if a decompound-token is involved. An example:
Indexing "deutsche Spielbankgesellschaft" it is analyzed as follow:
GET our_german_index/_analyze
{
"analyzer" : "default",
"text" : "deutsche Spielbankgesellschaft"
}
{
"tokens": [
{
"token": "deutsche",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "spielbankgesellschaft",
"start_offset": 9,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "spiel",
"start_offset": 9,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "bank",
"start_offset": 9,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "gesellschaft",
"start_offset": 9,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 1
}
]
}
This looks good. But notice: The tokens "spielbankgesellschaft", "spiel", "bank" and "gesellschaft" are all at the same "position".
Hence, the phrase-query "deutsche bank" matches and returns the document. Technically it makes sense. But the user would not expect a hit of "deutsche spielbankgesellschaft" when searching for the "deutsche bank".
We are searching a solution for this. In other words: Whenever a phrase-query is executed, the tokens generated by the Compound Word Tokens Filters should be ignored. In a normal match query it is ok and required that 'deutsche bank' returns also 'deutsche spielbankgesellschaft'.
Did anyone had this problem? Is there a general solution available?