Phrase Query breaks with "Compound Word Token Filters"

We are evaluating the use of Elastic's "Compound Word Token Filters" and @jprante's "Decompound Plugin" for a large index of German documents.

So far both work fine. The one of @jprante works even a little better. :wink:

The problem is that the phrase-query of elasticsearch breaks if a decompound-token is involved. An example:

Indexing "deutsche Spielbankgesellschaft" it is analyzed as follow:

GET our_german_index/_analyze
{
  "analyzer" : "default",
  "text" : "deutsche Spielbankgesellschaft"
}
  
{
  "tokens": [
    {
      "token": "deutsche",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "spielbankgesellschaft",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "spiel",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "bank",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "gesellschaft",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

This looks good. But notice: The tokens "spielbankgesellschaft", "spiel", "bank" and "gesellschaft" are all at the same "position".

Hence, the phrase-query "deutsche bank" matches and returns the document. Technically it makes sense. But the user would not expect a hit of "deutsche spielbankgesellschaft" when searching for the "deutsche bank".

We are searching a solution for this. In other words: Whenever a phrase-query is executed, the tokens generated by the Compound Word Tokens Filters should be ignored. In a normal match query it is ok and required that 'deutsche bank' returns also 'deutsche spielbankgesellschaft'.

Did anyone had this problem? Is there a general solution available?

Does anyone has an idea? Are there any ES projects using decompounding?

I'm not sure I fully understand the problem, but maybe you simply shouldn't do a phrase match on a decompounded field?

Let's make this a bit more concrete with a full example (mapping, analyze, data, and query):

DELETE /test

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "hyphenation_decompounder": {
          "type": "hyphenation_decompounder",
          "hyphenation_patterns_path": "hyph/de.xml",
          "word_list": [
            "spiel",
            "bank",
            "gesellschaft"
          ]
        }
      },
      "analyzer": {
        "my_lowercase_analyzer": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        },
        "my_hyphenation_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "hyphenation_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "quote": {
          "type": "text",
          "fields": {
            "lowercase": {
              "type": "text",
              "analyzer": "my_lowercase_analyzer"
            },
            "hyphenation": {
              "type": "text",
              "analyzer": "my_hyphenation_analyzer"
            }
          }
        }
      }
    }
  }
}

GET /test/_analyze
{
  "analyzer" : "my_hyphenation_analyzer",
  "text" : "deutsche Spielbankgesellschaft"
}
POST /test/_doc
{
  "quote": "deutsche Spielbankgesellschaft"
}

GET /test/_search
{
  "query": {
    "match": {
      "quote.lowercase": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match_phrase": {
      "quote.lowercase": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match": {
      "quote.hyphenation": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match_phrase": {
      "quote.hyphenation": "deutsche bank"
    }
  }
}

The .lowercase field only finds the document on the match but not on the match_phrase. The .hyphenation field finds them on both.

Trying another example just looking for "bank":

GET /test/_search
{
  "query": {
    "match": {
      "quote.lowercase": "bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match": {
      "quote.hyphenation": "bank"
    }
  }
}

Only the .hyphenation field finds the document. So you probably need a combination depending on what people are searching for.

This might be a total oversimplification, but maybe you want to implement something like this. If the user searches for:

  1. A single token:
    • Search on the .hyphenation field with a match.
    • Or maybe do a boolean query with should and a match on both .hyphenation and .lowercase — that would give "deutsche bank" more relevancy than "deutsche spielbankgesellschaft" when searching for "bank".
  2. Multiple tokens:
    • Search on the .lowercase field with a match_phrase.
    • Optionally add a suggestion if that might be too strict.

Thanks a lot @xeraa for the detailed answer.

Yes, your solution to use two different fields to the same content might work in some cases. In our case this solution has some major drawbacks:

  • The index size is increasing significant (we are talking about 800 Mio. documents containing large content like articles and ebooks)
  • Our application needs to use the query-string-query syntax like: 'Geld AND Boerse AND "Deutsche Bank"' --> In such a query we can't extract the phrase-part easily and let it match against a different field, right?

Index size: Do you need hyphenation on all fields? Though I assume for best search results you'll want a combination of lowercase, lowercase + stop words + stemming (+ synonyms), hyphenation,... at least for some fields.
Also did you check what that actually means for storage? Since you are "only" adding more inverted indices, I'd be curious what you are using now and how much the increase would be.

Why query string? Couldn't your application rewrite the queries to a boolean query? I think that would give you some more flexibility and features. You have some implicit knowledge when you create the query (what should use hyphenation and what not,...) and I think you'll need to make use of that to get better results.

Thanks for your feedback. We will discuss your suggestion in our project-team. :grinning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.