Query returning false results when term exceeds ngram length

Paul_Davies · December 19, 2017, 9:27am

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.

For example, here is the mapping:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

and document:

POST my_index/doc/1
{
  "title": "Quick fox with id of ABCDEFGHIJKLMNOP" 
}

If I run the query:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "fox wi"
      }
    }
  }
}

It returns the document as expected. However, if I run this:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "ABCDEFGHIJxxx"
      }
    }
  }
}

It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?

I am using version 5.

dadoonet · December 19, 2017, 10:37am

You can use a simple analyzer at search time instead the default ngram one that you set.

A full example here: https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-analyzer.html

Paul_Davies · December 19, 2017, 1:28pm

So you're saying the mappings part is like this?

"mappings": {
      "doc": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "autocomplete"
          }
        }
      }
    }

Unfortunately this doesn't work with match_phrase.

This query:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "quick fox"
      }
    }
  }
}

Now return no results

Paul_Davies · December 19, 2017, 2:17pm

I seem to have found a solution to this. The solution was to change the default search_analyser, but not in the way suggested by @dadoonet.

Here is the mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "autocomplete_search",
          "filter": [
          	"lowercase"
          ]
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter", "digit"
          ]
        },
        "autocomplete_search": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 100,
          "token_chars": [
            "letter", "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

I don't 100% understand what is going on here, but the search_analyser defaults to be the same as the analyser.

Please correct me if I'm wrong on any of this:

The analyser is what is applied to the the record being indexed at indexing time, and the search_analyser field is applied to the search term at query time.

When applied to the search term, the term is broken down into ngrams with a maximum length of 10.
Comparing ABCDEFGHIJxxx to ABCDEFGHIJKLMNOP comes out as a positive match as it is only comparing the first ten characters.

I din't want to increase max_gram for the indexing analyser too much because this can slow down both indexing and searching.

So I have increased max_gram for the search term, and it is now trying yo match up to the first 100 characters (should be sufficient) to the 10 character ngrams in the index.

This means that typing the exact code ABCDEFGHIJKLMNOP doesn't come up with a match, but this can be fixed by indexing another field with the standard analyser and doing a multimatch on both fields.

dadoonet · December 19, 2017, 5:47pm

autocomplete and autocomplete_search are doing the same thing.
So IMO, this:

        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }

is the same as:

        "title": {
          "type": "text",
          "analyzer": "autocomplete"
        }

Paul_Davies · December 19, 2017, 6:54pm

They're not the same: the autocomplete analyzer uses the autocomplete tokenizer, and the utocomplete analyzer_search uses the autocomplete_search tokenizer (may have been an idea to use different names for the analyzer and tokenizer. Autocomplete and autocomplete_search tokenizers have max_gram set to 10 and 100 respectively, which is how I got it to work.

system · January 16, 2018, 6:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokens outside the ngram size Elasticsearch	2	284	July 6, 2017
Phrase matching using query_string on nGram analyzed data Elasticsearch	4	1623	July 6, 2017
Edge Ngram search text greater than max_gram Elasticsearch	1	537	July 5, 2017
Deprecation: Deprecated big difference between max_gram and min_gram in NGram Elasticsearch	7	10636	April 19, 2018
Ngram search always returns same document Elasticsearch	8	1144	July 5, 2017

Query returning false results when term exceeds ngram length

Related topics