Phrase Query breaks with "Compound Word Token Filters"

mos · July 6, 2018, 8:47am

We are evaluating the use of Elastic's "Compound Word Token Filters" and @jprante's "Decompound Plugin" for a large index of German documents.

So far both work fine. The one of @jprante works even a little better.

The problem is that the phrase-query of elasticsearch breaks if a decompound-token is involved. An example:

Indexing "deutsche Spielbankgesellschaft" it is analyzed as follow:

GET our_german_index/_analyze
{
  "analyzer" : "default",
  "text" : "deutsche Spielbankgesellschaft"
}
  
{
  "tokens": [
    {
      "token": "deutsche",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "spielbankgesellschaft",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "spiel",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "bank",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "gesellschaft",
      "start_offset": 9,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

This looks good. But notice: The tokens "spielbankgesellschaft", "spiel", "bank" and "gesellschaft" are all at the same "position".

Hence, the phrase-query "deutsche bank" matches and returns the document. Technically it makes sense. But the user would not expect a hit of "deutsche spielbankgesellschaft" when searching for the "deutsche bank".

We are searching a solution for this. In other words: Whenever a phrase-query is executed, the tokens generated by the Compound Word Tokens Filters should be ignored. In a normal match query it is ok and required that 'deutsche bank' returns also 'deutsche spielbankgesellschaft'.

Did anyone had this problem? Is there a general solution available?

mos · July 13, 2018, 9:36am

Does anyone has an idea? Are there any ES projects using decompounding?

xeraa · July 15, 2018, 12:23am

I'm not sure I fully understand the problem, but maybe you simply shouldn't do a phrase match on a decompounded field?

Let's make this a bit more concrete with a full example (mapping, analyze, data, and query):

DELETE /test

PUT /test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "hyphenation_decompounder": {
          "type": "hyphenation_decompounder",
          "hyphenation_patterns_path": "hyph/de.xml",
          "word_list": [
            "spiel",
            "bank",
            "gesellschaft"
          ]
        }
      },
      "analyzer": {
        "my_lowercase_analyzer": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        },
        "my_hyphenation_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "hyphenation_decompounder"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "quote": {
          "type": "text",
          "fields": {
            "lowercase": {
              "type": "text",
              "analyzer": "my_lowercase_analyzer"
            },
            "hyphenation": {
              "type": "text",
              "analyzer": "my_hyphenation_analyzer"
            }
          }
        }
      }
    }
  }
}

GET /test/_analyze
{
  "analyzer" : "my_hyphenation_analyzer",
  "text" : "deutsche Spielbankgesellschaft"
}
POST /test/_doc
{
  "quote": "deutsche Spielbankgesellschaft"
}

GET /test/_search
{
  "query": {
    "match": {
      "quote.lowercase": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match_phrase": {
      "quote.lowercase": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match": {
      "quote.hyphenation": "deutsche bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match_phrase": {
      "quote.hyphenation": "deutsche bank"
    }
  }
}

The .lowercase field only finds the document on the match but not on the match_phrase. The .hyphenation field finds them on both.

Trying another example just looking for "bank":

GET /test/_search
{
  "query": {
    "match": {
      "quote.lowercase": "bank"
    }
  }
}
GET /test/_search
{
  "query": {
    "match": {
      "quote.hyphenation": "bank"
    }
  }
}

Only the .hyphenation field finds the document. So you probably need a combination depending on what people are searching for.

This might be a total oversimplification, but maybe you want to implement something like this. If the user searches for:

A single token:
- Search on the .hyphenation field with a match.
- Or maybe do a boolean query with should and a match on both .hyphenation and .lowercase — that would give "deutsche bank" more relevancy than "deutsche spielbankgesellschaft" when searching for "bank".
Multiple tokens:
- Search on the .lowercase field with a match_phrase.
- Optionally add a suggestion if that might be too strict.

mos · July 16, 2018, 7:30am

Thanks a lot @xeraa for the detailed answer.

Yes, your solution to use two different fields to the same content might work in some cases. In our case this solution has some major drawbacks:

The index size is increasing significant (we are talking about 800 Mio. documents containing large content like articles and ebooks)
Our application needs to use the query-string-query syntax like: 'Geld AND Boerse AND "Deutsche Bank"' --> In such a query we can't extract the phrase-part easily and let it match against a different field, right?

xeraa · July 16, 2018, 12:35pm

Index size: Do you need hyphenation on all fields? Though I assume for best search results you'll want a combination of lowercase, lowercase + stop words + stemming (+ synonyms), hyphenation,... at least for some fields.
Also did you check what that actually means for storage? Since you are "only" adding more inverted indices, I'd be curious what you are using now and how much the increase would be.

Why query string? Couldn't your application rewrite the queries to a boolean query? I think that would give you some more flexibility and features. You have some implicit knowledge when you create the query (what should use hyphenation and what not,...) and I think you'll need to make use of that to get better results.

mos · July 16, 2018, 12:53pm

Thanks for your feedback. We will discuss your suggestion in our project-team.

system · August 13, 2018, 12:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fixed: Phrase Query breaks with “Compound Word Token Filters” Elasticsearch	1	709	October 26, 2018
Compound Words not found but Filter is configured Elasticsearch	5	651	July 5, 2017
Compound word token filter with german umlaute Elasticsearch	1	691	December 1, 2018
Search Match for all tokens from decompound filter Elasticsearch	4	430	March 1, 2023
Decompounder in query_string analyzer Elasticsearch	1	656	July 6, 2017

Phrase Query breaks with "Compound Word Token Filters"

Related topics