Guidance on partial keyword search

Hi,

What is the most effective approach for implementing partial keyword search in the context of lengthy search texts, exceeding 5000 characters? For instance, if the keyword is "Exclusive," searching with "clus" should yield a match.

While ngram and edge_ngram methods are suitable for such requirements, setting min_gram to 1 and max_gram to 20 can significantly strain the server when dealing with texts surpassing 5000 characters. Even optimizing the min_gram and max_gram values to, say, 2 and 10 would result in an abundance of tokens due to the extensive character count. This poses a challenge in terms of server load and efficiency.

Thanks

Hi Sandip and welcome to the community,

You're right that using n-gram has a cost as it explodes the number of tokens being generated. There are some ways to mitigate this:

However, I'd like to take a step back and ask some more details about your use case. What are the requirements for the partial search (on what type of field, what information does the field contain, are you powering a search bar or highlighting etc)? Do you want to match even on 1 character (e.g. any document that contains "a" in that field)? And also match a few characters anywhere within a 5000+ character text?

Hi Adam,

Thank you for your prompt response.

Here are my follow-up comments based on your feedback:

  • I acknowledge the necessity to reevaluate the current use of ngram with min_gram and max_gram.
  • As you rightly emphasized, the wildcard query is not recommended for partial search, and I appreciate our alignment on this approach.
  • I take note of your suggestion to explore the search-as-you-type feature and value your input in considering it as a viable alternative.

Regarding the use case:

With approximately 800K documents in the index, the challenges are evident. The existing implementation of the search-as-you-type feature utilizes ngram, consolidating 5 fields into a single field for searching and applying the ngram analyzer to this combined field. Notably, the 'description' field, containing substantial text most of the time, presents a unique challenge. While the other fields collectively have values that generally do not exceed 400 characters, my objective is to extend the search-as-you-type functionality to cover all these fields. It's worth mentioning that all these fields contain alphabets, numbers, spaces, and sometimes special characters. Currently, highlighting is not in use. The primary goal is to display the relevant document in the search results when a user searches, and if the given keywords partially match.

I hope this provides understanding of the scenario.

Thanks for providing some more details. I'm afraid I can't think of any better solution off the top of my head than reevaluating the requirements and fine-tuning your analyzer and index mapping accordingly.

For example, could you loosen the requirement for matching 1-2 characters or matching >10 chars? If the field is full text (and so it contains words separated by spaces and punctuation), I find it unlikely that someone would want to type in 10+ chars and still continue typing to narrow down search results.

Alternatively you can define a mapping where description is analyzed into grams of 3..10 chars, and the other 4 fields with 1..10 chars, then combine the search with a bool query, something like below. This will support hybrid matching while limiting memory usage. Note this analyzer strips the spaces and punctuation, and breaks up the text on a word by word basis.

PUT ngram_test
{
  "settings": {
    "index.max_ngram_diff": 10, # Allow large difference between min/max_ngram
    "analysis": {
      "tokenizer": {
        "title_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        },
        "description_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      },
      "analyzer": {
        "title_analyzer": {
          "filter" : ["lowercase", "asciifolding"],
          "tokenizer": "title_ngram_tokenizer"
        },
        "description_analyzer": {
          "filter" : ["lowercase", "asciifolding"],
          "tokenizer": "description_ngram_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "title_analyzer"
          }
        }
      },
      "description": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "description_analyzer"
          }
        }
      },
      "status": {
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "title_analyzer"
          }
        }
      }
    }
  }
}

Query:

POST ngram_test/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title.ngram": {
              "query": "abc",
              "operator": "AND" # All ngrams of the query should match ngrams of the field
            }
          }
        },
        {
          "match": {
            "description.ngram": {
              "query": "abc",
              "operator": "AND"
            }
          }
        },
        {
          "match": {
            "status.ngram": {
              "query": "abc",
              "operator": "AND"
            }
          }
        }
      ], # Finds 3-10 char fragments in all fields; finds 1-2 char fragments in fields other than description
      "minimum_should_match": 1
    }
  }
}

In any case I think the key to improving performance is ensuring that either the number of ngrams generated or the length of input to the ngrams is kept within certain boundaries.

I hope this helps!

Thank you, Adam. I'll incorporate your suggestions and develop a functional solution.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.