NGram search troubles(Is it possible to match entire string)

N_Antech · August 6, 2019, 10:45pm

I am almost positive that this is a simple misunderstanding on my part as I'm very new to Elasticsearch. I am trying to implement a substring matching search using ngrams. I've got it set as max_ngram_diff of 10. My setup looks something like:

  "filter": {
      "barcode_filter": {
        "type": "nGram",
        "min_gram": "4",
        "max_gram": "14"
      }
    },
    "analyzer": {
      "barcode_filter_analyzer": {
        "filter": [
          "lowercase",
          "barcode_filter"
        ],
        "type": "custom",
        "tokenizer": "standard"
      }
    }

My field is defined as barcode with analyzer set to barcode_filter_analyzer.

My goal is to be able to find substrings. So something like "V741" should find every barcode with that substring. Something like "ZR000041" should also find every barcode containing that entire substring. While the second example will find every barcode containing the entire substring, it also finds every barcode containing "0000" for example. My initial thought was to use the score but that doesn't seem to work as a barcode like 00000000000101 will end up having a very high score.

Is there any good way to require that the entire query string be found? This is currently only in testing on my PC so anything requiring scrapping and restarting the index is not a big deal so if I'm approaching this in entirely the wrong way I'm happy to adjust. Any help at all would be greatly appreciated!

spinscale · August 7, 2019, 7:49am

hey,

so the reason for this behaviour is, that by configuring a ngram filter in your mapping, this filter will also be used when querying data. This means a string like ZR000041, will be depicted into ngrams on the query side as well.

See this example

DELETE test

PUT test
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "barcode_filter_analyzer",
        "fields": {
          "keyword" : {
            "type" : "keyword"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "barcode_filter": {
          "type": "nGram",
          "min_gram": "4",
          "max_gram": "5"
        }
      },
      "analyzer": {
        "barcode_filter_analyzer": {
          "filter": [
            "lowercase",
            "barcode_filter"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

GET test/_analyze
{
  "text": "ZR000041",
  "analyzer": "barcode_filter_analyzer"
}

PUT test/_doc/2
{
  "my_field": "ZR000042"
}


PUT test/_doc/1?refresh=true
{
  "my_field": "ZR000041"
}

# finds only one product
GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "0041"
      }
    }
  }
}

# finds both products
GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041",
        "operator": "and"
      }
    }
  }
}

# favor exact match
GET test/_search 
{
  "query": {
    "multi_match": {
      "query": "ZR000041",
      "fields": [ "my_field", "my_field.keyword^2" ]
    }
  }
}

One last thing in order to understand scoring is to use the explain: true parameter in your search, as this will show you the final lucene query.

GET test/_search 
{
  "explain": true, 
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041"
      }
    }
  }
}

Hope this helps...

--Alex

abdon · August 7, 2019, 3:03pm

To add to Alex' explanation here, one solution could be to provide a search analyzer that does not apply the ngram filter to the query terms. You could do that by configuring a search_analyzer in your mapping, or by providing an analyzer in your query:

GET test/_search 
{
  "query": {
    "match": {
      "my_field": {
        "query": "ZR000041",
        "analyzer": "standard"
      }
    }
  }
}

N_Antech · August 7, 2019, 3:36pm

Thanks Alex, I figured that was what was happening but was having trouble wording it! Thanks for the tip on the explain, I'd not seen that before!

N_Antech · August 7, 2019, 4:03pm

Abdon, this seems to do exactly what I wanted. Thank you!

system · September 4, 2019, 4:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exact Sub-String Match \| ElasticSearch Elasticsearch	4	4043	December 9, 2019
Better effective substring query idea? Elasticsearch	13	1520	July 6, 2017
Term matching with elastic search edge n gram Elasticsearch	8	1880	March 7, 2017
Substring search Elasticsearch	2	458	July 6, 2017
Phrase matching using query_string on nGram analyzed data Elasticsearch	4	1615	July 6, 2017

NGram search troubles(Is it possible to match entire string)

Related topics