Exact Sub-String Match | ElasticSearch

Soberwolf · November 11, 2019, 6:13am

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):

SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';

Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:

Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:

Note: The max length of customer_id field is VARCHAR2(100).

{
   "customer-index":{
      "aliases":{
      },
      "mappings":{
         "customer":{
            "properties":{
               "customerName":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  }
               },
               "customerId":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  },
                  "analyzer":"substring_analyzer"
               }
            }
         }
      },
      "settings":{
         "index":{
            "number_of_shards":"3",
            "provided_name":"customer-index",
            "creation_date":"1573333835055",
            "analysis":{
               "filter":{
                  "substring":{
                     "type":"ngram",
                     "min_gram":"3",
                     "max_gram":"100"
                  }
               },
               "analyzer":{
                  "substring_analyzer":{
                     "filter":[
                        "lowercase",
                        "substring"
                     ],
                     "type":"custom",
                     "tokenizer":"standard"
                  }
               }
            },
            "number_of_replicas":"1",
            "uuid":"XXXXXXXXXXXXXXXXX",
            "version":{
               "created":"5061699"
            }
         }
      }
   }
}

Request to query the data looks like this:

{
  "from": 0,
  "size": 10,
  "sort": [
    {
      "name.keyword": {
        "missing": "_first",
        "order": "asc"
      }
    }
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "0995",
            "fields": [
              "customer_id"
            ],
            "analyzer": "substring_analyzer"
          }
        }
      ]
    }
  }
}

With that being said, here are couple of queries/issue:

Lets say there are 3 records with customer_id:

0009950011214, 0009900011214, 0009920011214

When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.

But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.

How should I fix this?

This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?

P.S: I'm on NEST 5.5.

Christian_Dahlqvist · November 11, 2019, 6:34am

Running wildcard queries against a keyword field could give you the result you are looking for, but having a leading wildcard is the most inefficient query you can run as all terms need to be scanned.

Ngrams should be faster. In your case you are getting matches on the 3 character ngrams, but you could get around this by using a different search-time analyzer that does not break your search term up. That way a 4 character string would only match 4 character ngrams etc.

Soberwolf · November 11, 2019, 7:01am

Thank you Christian, for your input. Yeah I agree that wild card query won't make the cut in my scenario. Do you have any recommendation on how we can have a search_analyzer that matches the ngrams with length = search string?

Christian_Dahlqvist · November 11, 2019, 9:05am

Index with ngrams and use an analyser that does not break down the search term. A standard or white space analyser might work at search time.

system · December 9, 2019, 9:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Better effective substring query idea? Elasticsearch	13	1520	July 6, 2017
NGram search troubles(Is it possible to match entire string) Elasticsearch	5	3505	September 4, 2019
Tokens outside the ngram size Elasticsearch	2	277	July 6, 2017
What search to use? Elasticsearch	7	773	July 5, 2017
Phrase matching using query_string on nGram analyzed data Elasticsearch	4	1615	July 6, 2017

Exact Sub-String Match | ElasticSearch

Related topics