Guidance on partial keyword search

Sandip_Maniar · January 15, 2024, 5:26pm

Hi,

What is the most effective approach for implementing partial keyword search in the context of lengthy search texts, exceeding 5000 characters? For instance, if the keyword is "Exclusive," searching with "clus" should yield a match.

While ngram and edge_ngram methods are suitable for such requirements, setting min_gram to 1 and max_gram to 20 can significantly strain the server when dealing with texts surpassing 5000 characters. Even optimizing the min_gram and max_gram values to, say, 2 and 10 would result in an abundance of tokens due to the extensive character count. This poses a challenge in terms of server load and efficiency.

Thanks

demjened · January 15, 2024, 6:40pm

Hi Sandip and welcome to the community,

You're right that using n-gram has a cost as it explodes the number of tokens being generated. There are some ways to mitigate this:

Using n-gram with a low min_gram - max_gram range (e.g. 3 to 10 characters).
Using a wildcard query - not recommended for partial (infix) search
Adding a search-as-you-type field

However, I'd like to take a step back and ask some more details about your use case. What are the requirements for the partial search (on what type of field, what information does the field contain, are you powering a search bar or highlighting etc)? Do you want to match even on 1 character (e.g. any document that contains "a" in that field)? And also match a few characters anywhere within a 5000+ character text?

Sandip_Maniar · January 15, 2024, 8:54pm

Hi Adam,

Thank you for your prompt response.

Here are my follow-up comments based on your feedback:

I acknowledge the necessity to reevaluate the current use of ngram with min_gram and max_gram.
As you rightly emphasized, the wildcard query is not recommended for partial search, and I appreciate our alignment on this approach.
I take note of your suggestion to explore the search-as-you-type feature and value your input in considering it as a viable alternative.

Regarding the use case:

With approximately 800K documents in the index, the challenges are evident. The existing implementation of the search-as-you-type feature utilizes ngram, consolidating 5 fields into a single field for searching and applying the ngram analyzer to this combined field. Notably, the 'description' field, containing substantial text most of the time, presents a unique challenge. While the other fields collectively have values that generally do not exceed 400 characters, my objective is to extend the search-as-you-type functionality to cover all these fields. It's worth mentioning that all these fields contain alphabets, numbers, spaces, and sometimes special characters. Currently, highlighting is not in use. The primary goal is to display the relevant document in the search results when a user searches, and if the given keywords partially match.

I hope this provides understanding of the scenario.

demjened · January 15, 2024, 9:56pm

Thanks for providing some more details. I'm afraid I can't think of any better solution off the top of my head than reevaluating the requirements and fine-tuning your analyzer and index mapping accordingly.

For example, could you loosen the requirement for matching 1-2 characters or matching >10 chars? If the field is full text (and so it contains words separated by spaces and punctuation), I find it unlikely that someone would want to type in 10+ chars and still continue typing to narrow down search results.

Alternatively you can define a mapping where description is analyzed into grams of 3..10 chars, and the other 4 fields with 1..10 chars, then combine the search with a bool query, something like below. This will support hybrid matching while limiting memory usage. Note this analyzer strips the spaces and punctuation, and breaks up the text on a word by word basis.

PUT ngram_test
{
  "settings": {
    "index.max_ngram_diff": 10, # Allow large difference between min/max_ngram
    "analysis": {
      "tokenizer": {
        "title_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        },
        "description_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      },
      "analyzer": {
        "title_analyzer": {
          "filter" : ["lowercase", "asciifolding"],
          "tokenizer": "title_ngram_tokenizer"
        },
        "description_analyzer": {
          "filter" : ["lowercase", "asciifolding"],
          "tokenizer": "description_ngram_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "title_analyzer"
          }
        }
      },
      "description": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "description_analyzer"
          }
        }
      },
      "status": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "title_analyzer"
          }
        }
      }
    }
  }
}

Query:

POST ngram_test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title.ngram": {
              "query": "abc",
              "operator": "AND" # All ngrams of the query should match ngrams of the field
            }
          }
        },
        {
          "match": {
            "description.ngram": {
              "query": "abc",
              "operator": "AND"
            }
          }
        },
        {
          "match": {
            "status.ngram": {
              "query": "abc",
              "operator": "AND"
            }
          }
        }
      ], # Finds 3-10 char fragments in all fields; finds 1-2 char fragments in fields other than description
      "minimum_should_match": 1
    }
  }
}

In any case I think the key to improving performance is ensuring that either the number of ngrams generated or the length of input to the ngrams is kept within certain boundaries.

I hope this helps!

Sandip_Maniar · January 16, 2024, 4:06pm

Thank you, Adam. I'll incorporate your suggestions and develop a functional solution.

system · February 13, 2024, 4:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query returning false results when term exceeds ngram length Elasticsearch	6	1601	January 16, 2018
Edge Ngram search text greater than max_gram Elasticsearch	1	534	July 5, 2017
What is required for partial match to work? Elasticsearch	6	581	July 6, 2017
Ngram search whole word Elasticsearch	2	488	March 14, 2019
nGram performance Elasticsearch	3	3636	July 6, 2017

Guidance on partial keyword search

Related topics