How to improve query speed

adyjayex · August 23, 2017, 6:29pm

Greetings,

Any advice into how to improve query speed for a specific use case would be greatly appreciated!

We currently have a cluster of 2 nodes (1 master and data, 1 data only), both running on AWS machines with the data folder on the recommended EBS for this kind of task. Each machine has 8 vCPU and 61GB RAM. 30GB of RAM is given to the ES Heap. The number of shards is currently set to 30 with 1 replica since we expect to grow to at least 30GB / shard in the upcoming months

Swapping has been set to be close to none on both servers.

We currently have an index with approx 500GB of data with 50 million documents.

The size of the index is due to the following:

We're using ngrams to be able to find substrings inside bigger strings and avoid using double-wildcard which takes longer to process. Ngrams are applied to big fields (more than 100.000 characters)
We're using keyword sub-fields on the same big fields but we ignore the documents which have over 8000 characters for removing duplicates via aggregations

The use case involves being able to search for text with special characters and also be able to find substrings inside bigger strings (i.e. find me$s in some$space) with highlighting. We're using the following analyzers:

"analysis": {
      "filter": {
        "ngram_filter": {
          "type": "nGram",
          "min_gram": 5,
          "max_gram": 8
        }
      },
      "analyzer": {
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "whitespace_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "ngram_filter"
          ]
        }
      }
    }

The _all field has been disabled and we only search the fields we require.

Given this, we currently retrieve results with an average response time of around 3-4 seconds.

Is there anything else we could do to improve search speed apart from adding more nodes to the cluster ?

Any feedback is greatly appreciated!

Thanks!

jpountz · August 23, 2017, 9:29pm

Why is the min_gram_size 5 and the max_gram size 8? I guess 5 is the minimum substring length that you allow to search for. I would use a max_gram size of 5 as well. This should make the index smaller and the query have fewer terms.

adyjayex · August 23, 2017, 11:23pm

There are certain cases where an ngram of 5 might not produce proper results, i.e. looking for www.some vs www. AND w.so AND some (off the top of my head).

We've set it to 5-8 to cover more ground when trying to find specific documents, while also reducing the number of ngrams a user might have to input to get them (for longer words for example).

jpountz · August 24, 2017, 7:33am

You could work around it by using phrase queries to make sure www.s, ww.so and w.som occur at consecutive positions? You would have to use the ngram tokenizer for it to work however (the ngram token filter puts all grams at the same position).

adyjayex · August 24, 2017, 1:31pm

Would that provide a significant performance improvement ?

We're currently using the Query String Query cause it covers most of our use cases and allows us to properly search for what we need to find, I was unable to find more on Phrase Queries apart from it allowing the search for a phrase.

Thanks!

jpountz · August 24, 2017, 1:48pm

Can you share what a query looks like and the output of the validate query API with rewrite=true?

adyjayex · August 24, 2017, 2:37pm

GET /test_v10/test/_validate/query?rewrite=true

{
    "query": {
       "bool":{
           "filter":{
             "range":{
                "addDate":{
                   "from":"1440426457000",
                   "include_lower":true,
                   "include_upper":true,
                   "to":"1535120857000"
                }
             }
          },
          "must":{
             "query_string":{
                "query":"dolphin AND delf*",
                "fields": ["postContent","postContent.ngram","postTitle","postTitle.ngram"]
             }
          }
       }
    }
}

Response:

{
   "valid": true,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "explanations": [
      {
         "index": "test_v10",
         "valid": true,
         "explanation": "+(+(postContent:dolphin | postContent.ngram:dolphin | postTitle:dolphin | postTitle.ngram:dolphin) +(postContent:delf* | postContent.ngram:delf* | postTitle:delf* | postTitle.ngram:delf*)) #addDate:[-9223372036854775808 TO 9223372036854775807]"
      }
   ]
}

system · September 21, 2017, 2:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Storage problem with ngram filters Elasticsearch	8	1232	November 24, 2017
Elastic search Ngram search high latency Elasticsearch	1	431	November 11, 2018
Return very slow when using ngram char split Elasticsearch	4	1131	July 5, 2017
Better effective substring query idea? Elasticsearch	13	1520	July 6, 2017
Tokens outside the ngram size Elasticsearch	2	277	July 6, 2017

How to improve query speed

Related topics