Query returning false results when term exceeds ngram length


(Paul Davies) #1

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.

For example, here is the mapping:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

and document:

POST my_index/doc/1
{
  "title": "Quick fox with id of ABCDEFGHIJKLMNOP" 
}

If I run the query:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "fox wi"
      }
    }
  }
}

It returns the document as expected. However, if I run this:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "ABCDEFGHIJxxx"
      }
    }
  }
}

It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?

I am using version 5.


(David Pilato) #2

You can use a simple analyzer at search time instead the default ngram one that you set.

A full example here: https://www.elastic.co/guide/en/elasticsearch/reference/6.1/search-analyzer.html


(Paul Davies) #3

So you're saying the mappings part is like this?

"mappings": {
      "doc": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "autocomplete"
          }
        }
      }
    }

Unfortunately this doesn't work with match_phrase.

This query:

POST my_index/doc/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "quick fox"
      }
    }
  }
}

Now return no results


(Paul Davies) #4

I seem to have found a solution to this. The solution was to change the default search_analyser, but not in the way suggested by @dadoonet.

Here is the mapping:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "autocomplete_search",
          "filter": [
          	"lowercase"
          ]
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10,
          "token_chars": [
            "letter", "digit"
          ]
        },
        "autocomplete_search": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 100,
          "token_chars": [
            "letter", "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

I don't 100% understand what is going on here, but the search_analyser defaults to be the same as the analyser.

Please correct me if I'm wrong on any of this:

The analyser is what is applied to the the record being indexed at indexing time, and the search_analyser field is applied to the search term at query time.

When applied to the search term, the term is broken down into ngrams with a maximum length of 10.
Comparing ABCDEFGHIJxxx to ABCDEFGHIJKLMNOP comes out as a positive match as it is only comparing the first ten characters.

I din't want to increase max_gram for the indexing analyser too much because this can slow down both indexing and searching.

So I have increased max_gram for the search term, and it is now trying yo match up to the first 100 characters (should be sufficient) to the 10 character ngrams in the index.

This means that typing the exact code ABCDEFGHIJKLMNOP doesn't come up with a match, but this can be fixed by indexing another field with the standard analyser and doing a multimatch on both fields.


(David Pilato) #5

autocomplete and autocomplete_search are doing the same thing.
So IMO, this:

        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }

is the same as:

        "title": {
          "type": "text",
          "analyzer": "autocomplete"
        }

(Paul Davies) #6

They're not the same: the autocomplete analyzer uses the autocomplete tokenizer, and the utocomplete analyzer_search uses the autocomplete_search tokenizer (may have been an idea to use different names for the analyzer and tokenizer. Autocomplete and autocomplete_search tokenizers have max_gram set to 10 and 100 respectively, which is how I got it to work.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.