Exact Sub-String Match | ElasticSearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):

SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';

Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:

  1. Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:

Note: The max length of customer_id field is VARCHAR2(100).

{
   "customer-index":{
      "aliases":{
      },
      "mappings":{
         "customer":{
            "properties":{
               "customerName":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  }
               },
               "customerId":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  },
                  "analyzer":"substring_analyzer"
               }
            }
         }
      },
      "settings":{
         "index":{
            "number_of_shards":"3",
            "provided_name":"customer-index",
            "creation_date":"1573333835055",
            "analysis":{
               "filter":{
                  "substring":{
                     "type":"ngram",
                     "min_gram":"3",
                     "max_gram":"100"
                  }
               },
               "analyzer":{
                  "substring_analyzer":{
                     "filter":[
                        "lowercase",
                        "substring"
                     ],
                     "type":"custom",
                     "tokenizer":"standard"
                  }
               }
            },
            "number_of_replicas":"1",
            "uuid":"XXXXXXXXXXXXXXXXX",
            "version":{
               "created":"5061699"
            }
         }
      }
   }
}

Request to query the data looks like this:

{
  "from": 0,
  "size": 10,
  "sort": [
    {
      "name.keyword": {
        "missing": "_first",
        "order": "asc"
      }
    }
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "0995",
            "fields": [
              "customer_id"
            ],
            "analyzer": "substring_analyzer"
          }
        }
      ]
    }
  }
}

With that being said, here are couple of queries/issue:

  1. Lets say there are 3 records with customer_id:

0009950011214, 0009900011214, 0009920011214

When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.

But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.

How should I fix this?

  1. This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?

P.S: I'm on NEST 5.5.

Running wildcard queries against a keyword field could give you the result you are looking for, but having a leading wildcard is the most inefficient query you can run as all terms need to be scanned.

Ngrams should be faster. In your case you are getting matches on the 3 character ngrams, but you could get around this by using a different search-time analyzer that does not break your search term up. That way a 4 character string would only match 4 character ngrams etc.

Thank you Christian, for your input. Yeah I agree that wild card query won't make the cut in my scenario. Do you have any recommendation on how we can have a search_analyzer that matches the ngrams with length = search string?

Index with ngrams and use an analyser that does not break down the search term. A standard or white space analyser might work at search time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.