Ngrams for compound words in ES 2.*

tallakh · April 3, 2016, 11:16am

I've been looking into using ngrams to solve searching using wrong spelling and handling complex compound words both in the index and query string.

I came across ngrams-compound-words which really fits my needs, but the example query does not work in ES 2.*. The "minimum_should_match" does not seem to work as in the example any more.

Is it possible to achieve the same functionality in ES 2.*?

The example query looks like this:

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "text": {
                "query":                "Gesundheit",
                "minimum_should_match": "80%" //this does not work in ES 2.*
            }
        }
    }
}

warkolm · April 3, 2016, 11:31pm

Can you elaborate more on the "doesn't work" part, what are you seeing?

tallakh · April 4, 2016, 8:21am

Changing the "minimum_should_match" has no effect on the search result in 2.*. In my understanding, a higher percentage would only return hits where more of the ngrams from the query matches a ngram in the document. The result should be a higher precision.

So far I've been able to replicate the functionality by splitting the query into ngram and constructing a boolean query, but that complicates this feature a lot.

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "minimum_should_match": "80%",
      "should": [
        {
          "term": {
            "text": "ges"
          }
        },
        {
          "term": {
            "text": "esu"
          }
        },
        {
          "term": {
            "text": "sun"
          }
        },
        {
          "term": {
            "text": "und"
          }
        },
        {
          "term": {
            "text": "ndh"
          }
        },
        {
          "term": {
            "text": "dhe"
          }
        },
        {
          "term": {
            "text": "hei"
          }
        },
        {
          "term": {
            "text": "eit"
          }
        }
      ]
    }
  }
}

tallakh · April 6, 2016, 6:52am

Can anyone confirm that this is the only (or best) way to achieve this functionality in ES 2.*?

mikemccand · April 6, 2016, 10:23am

Hmm the minimum_should_match set on the query string only applies if the query parsed to a boolean query, and coord was disabled. Here's the comment on top of this logic, in ES master QueryStringBuilder.java:

        // If the coordination factor is disabled on a boolean query we don't apply the minimum should match.
        // This is done to make sure that the minimum_should_match doesn't get applied when there is only one word
        // and multiple variations of the same word in the query (synonyms for instance).
        if (query instanceof BooleanQuery && !((BooleanQuery) query).isCoordDisabled()) {
            query = Queries.applyMinimumShouldMatch((BooleanQuery) query, this.minimumShouldMatch());
        }

We need to see exactly what query class ES created on parsing your query string with your ngram tokenizer...

Topic		Replies	Views
Minimun_should_match does not work at ES 5.* version Elasticsearch	3	662	July 7, 2017
Minimum_should_match does not work with ngram Elasticsearch	1	886	June 26, 2017
Minimum_should_match and partial word matching 5.x does not work Elasticsearch	2	539	April 20, 2017
Match phrase and minimum_should_match combination Elasticsearch	1	855	July 6, 2017
Combining filters? Also, query string questions Elasticsearch	2	829	July 6, 2017

Ngrams for compound words in ES 2.*

Related topics