NGram search troubles(Is it possible to match entire string)

I am almost positive that this is a simple misunderstanding on my part as I'm very new to Elasticsearch. I am trying to implement a substring matching search using ngrams. I've got it set as max_ngram_diff of 10. My setup looks something like:

  "filter": {
      "barcode_filter": {
        "type": "nGram",
        "min_gram": "4",
        "max_gram": "14"
      }
    },
    "analyzer": {
      "barcode_filter_analyzer": {
        "filter": [
          "lowercase",
          "barcode_filter"
        ],
        "type": "custom",
        "tokenizer": "standard"
      }
    }

My field is defined as barcode with analyzer set to barcode_filter_analyzer.

My goal is to be able to find substrings. So something like "V741" should find every barcode with that substring. Something like "ZR000041" should also find every barcode containing that entire substring. While the second example will find every barcode containing the entire substring, it also finds every barcode containing "0000" for example. My initial thought was to use the score but that doesn't seem to work as a barcode like 00000000000101 will end up having a very high score.

Is there any good way to require that the entire query string be found? This is currently only in testing on my PC so anything requiring scrapping and restarting the index is not a big deal so if I'm approaching this in entirely the wrong way I'm happy to adjust. Any help at all would be greatly appreciated!

hey,

so the reason for this behaviour is, that by configuring a ngram filter in your mapping, this filter will also be used when querying data. This means a string like ZR000041, will be depicted into ngrams on the query side as well.

See this example

DELETE test

PUT test
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "barcode_filter_analyzer",
        "fields": {
          "keyword" : {
            "type" : "keyword"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "barcode_filter": {
          "type": "nGram",
          "min_gram": "4",
          "max_gram": "5"
        }
      },
      "analyzer": {
        "barcode_filter_analyzer": {
          "filter": [
            "lowercase",
            "barcode_filter"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

GET test/_analyze
{
  "text": "ZR000041",
  "analyzer": "barcode_filter_analyzer"
}

PUT test/_doc/2
{
  "my_field": "ZR000042"
}


PUT test/_doc/1?refresh=true
{
  "my_field": "ZR000041"
}

# finds only one product
GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "0041"
      }
    }
  }
}

# finds both products
GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041",
        "operator": "and"
      }
    }
  }
}

# favor exact match
GET test/_search 
{
  "query": {
    "multi_match": {
      "query": "ZR000041",
      "fields": [ "my_field", "my_field.keyword^2" ]
    }
  }
}

One last thing in order to understand scoring is to use the explain: true parameter in your search, as this will show you the final lucene query.

GET test/_search 
{
  "explain": true, 
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041"
      }
    }
  }
}

Hope this helps...

--Alex

To add to Alex' explanation here, one solution could be to provide a search analyzer that does not apply the ngram filter to the query terms. You could do that by configuring a search_analyzer in your mapping, or by providing an analyzer in your query:

GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041",
        "analyzer": "standard"
      }
    }
  }
}
2 Likes

Thanks Alex, I figured that was what was happening but was having trouble wording it! Thanks for the tip on the explain, I'd not seen that before!

Abdon, this seems to do exactly what I wanted. Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.