How to improve query speed

Greetings,

Any advice into how to improve query speed for a specific use case would be greatly appreciated!

We currently have a cluster of 2 nodes (1 master and data, 1 data only), both running on AWS machines with the data folder on the recommended EBS for this kind of task. Each machine has 8 vCPU and 61GB RAM. 30GB of RAM is given to the ES Heap. The number of shards is currently set to 30 with 1 replica since we expect to grow to at least 30GB / shard in the upcoming months

Swapping has been set to be close to none on both servers.

We currently have an index with approx 500GB of data with 50 million documents.

The size of the index is due to the following:

  • We're using ngrams to be able to find substrings inside bigger strings and avoid using double-wildcard which takes longer to process. Ngrams are applied to big fields (more than 100.000 characters)
  • We're using keyword sub-fields on the same big fields but we ignore the documents which have over 8000 characters for removing duplicates via aggregations

The use case involves being able to search for text with special characters and also be able to find substrings inside bigger strings (i.e. find me$s in some$space) with highlighting. We're using the following analyzers:

"analysis": {
      "filter": {
        "ngram_filter": {
          "type": "nGram",
          "min_gram": 5,
          "max_gram": 8
        }
      },
      "analyzer": {
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "whitespace_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "ngram_filter"
          ]
        }
      }
    }

The _all field has been disabled and we only search the fields we require.

Given this, we currently retrieve results with an average response time of around 3-4 seconds.

Is there anything else we could do to improve search speed apart from adding more nodes to the cluster ?

Any feedback is greatly appreciated!

Thanks!

Why is the min_gram_size 5 and the max_gram size 8? I guess 5 is the minimum substring length that you allow to search for. I would use a max_gram size of 5 as well. This should make the index smaller and the query have fewer terms.

There are certain cases where an ngram of 5 might not produce proper results, i.e. looking for www.some vs www. AND w.so AND some (off the top of my head).

We've set it to 5-8 to cover more ground when trying to find specific documents, while also reducing the number of ngrams a user might have to input to get them (for longer words for example).

You could work around it by using phrase queries to make sure www.s, ww.so and w.som occur at consecutive positions? You would have to use the ngram tokenizer for it to work however (the ngram token filter puts all grams at the same position).

Would that provide a significant performance improvement ?

We're currently using the Query String Query cause it covers most of our use cases and allows us to properly search for what we need to find, I was unable to find more on Phrase Queries apart from it allowing the search for a phrase.

Thanks!

Can you share what a query looks like and the output of the validate query API with rewrite=true?

GET /test_v10/test/_validate/query?rewrite=true

{
    "query": {
       "bool":{
           "filter":{
             "range":{
                "addDate":{
                   "from":"1440426457000",
                   "include_lower":true,
                   "include_upper":true,
                   "to":"1535120857000"
                }
             }
          },
          "must":{
             "query_string":{
                "query":"dolphin AND delf*",
                "fields": ["postContent","postContent.ngram","postTitle","postTitle.ngram"]
             }
          }
       }
    }
}

Response:

{
   "valid": true,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "explanations": [
      {
         "index": "test_v10",
         "valid": true,
         "explanation": "+(+(postContent:dolphin | postContent.ngram:dolphin | postTitle:dolphin | postTitle.ngram:dolphin) +(postContent:delf* | postContent.ngram:delf* | postTitle:delf* | postTitle.ngram:delf*)) #addDate:[-9223372036854775808 TO 9223372036854775807]"
      }
   ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.