Elasticsearch Query optimization

We have the 780 GB of data at the elastic index, when triggering the below query :-
{ "query": { "function_score": { "query": { "bool": { "must": [ { "query_string": { "query": "(coreStringCaseIns:\"Auto enhance photos with just a tap.\" OR locStringCaseIns:\"Auto enhance photos with just a tap.\")", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } } ], "filter": [ { "bool": { "should": [ { "term": { "field1": { "value": "45001", "boost": 1 } } }, { "term": { "field2": { "value": "45002", "boost": 1 } } }, { "term": { "field": { "value": "45003", "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, { "bool": { "adjust_pure_negative": true, "boost": 1 } }, { "bool": { "adjust_pure_negative": true, "boost": 1 } }, { "bool": { "adjust_pure_negative": true, "boost": 1 } }, { "bool": { "adjust_pure_negative": true, "boost": 1 } } ], "adjust_pure_negative": true, "boost": 1 } }, "functions": [ { "filter": { "bool": { "must": [ { "query_string": { "query": "(id:\"10426\")", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } } ], "adjust_pure_negative": true, "boost": 1 } }, "weight": 297630966000000 }, { "filter": { "bool": { "must": [ { "query_string": { "query": "(id:\"10110\")", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } } ], "adjust_pure_negative": true, "boost": 1 } }, "weight": 33242801500000 }, { "filter": { "bool": { "must": [ { "query_string": { "query": "(id:\"522\")", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } } ], "adjust_pure_negative": true, "boost": 1 } }, "weight": 1385116730000 }, { "filter": { "bool": { "must": [ { "query_string": { "query": "(locale:\"ja_JP\")", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } } ], "adjust_pure_negative": true, "boost": 1 } }, "weight": 9999999800000 }, { "filter": { "bool": { "must": [ { "range": { "modify": { "from": "now-30d", "to": null, "include_lower": true, "include_upper": true, "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, "weight": 9999999800000 } ], "score_mode": "multiply", "max_boost": 3.4028235e+38, "boost": 1 } } }

ElasticSearch is taking 15.867s to return the results. We have already tried most of the optimizations from our side, I am posting this question to find out if there are still any optimizations possible.

How many shards is your index?

Hi theDor,

There are 5 primary shards and 1 replica shards.

Sharing the System configurations :-

Total size - 1.55 TB
Used - 793 GB

There are 4 nodes having 5 primary shards and each primary shard have 1 replica shard.

Hope this info helps, right now performance is very bad, as mentioned for the above query elastic search takes 15 sec to fetch results.

I am eagerly looking for any scope of optimization in the query or any other work around to make performance better.

Please let me know if you need any more info.

Thanks!

I would suggest number of solutions:

  1. Increase your cluster system (More memory)
  2. Incrase the number of shards the index have (The best shard size is between 20GB to 40GB, with more shards your query will be more distrbuted across your Elasticsearch cluster)
  3. split the index data to several indices (by time series etc.) and query only the necessary data, then you will not query 780GB of data you will query less and get better results (ofc it depends on your needs)
  4. Try using search profiler on Dev Tools on kibana and then you can see every time if you get better results

Hope it will help you

Hi theDor,

Thankyou for the response.

I would like to clarify few things.

  1. I have the cluster size of 1.55 GB with 4 data nodes so by increasing the cluster size I guess you mean to increase the nodes also can you please suggest what is the optimum cluster size and the number of nodes for 750 GB of data.

  2. If I don't change the number of nodes and cluster size, will increasing the number of primary shards such that the data is in the range of 20 GB to 40 GB per shard will reduce the search time significantly?

i would suggest to increase the index number of shards to between 20 and 35 shards (depends how many data nodes you can add).
from my experience, increasing the number of shards and make them smaller, make the query time runs faster, because every shards is an inverted index and the query becomes more distrubted across the cluster.

1 Like

Just curious... have you tried the query without that range filter?

Hi itizir,

I tried running the query after removing the range filter but still there is not any improvement.

Hey again.

Hm, I see. I was just suggesting because we've seen poor performance of range queries, and hadn't looked at the query closely.

I'm not particularly familiar with function_score and full text search, but can still try to help...

  • Why all the empty bool clauses in the filter? (probably not affecting things though)
  • Except for the time range, all queries in the functions seem like they should be simple term queries: matching project IDs, etc. (or am I misunderstanding). So not sure why relegating them to functions.
  • The first main query (looking for "Auto enhance photos with just a tap."): is that looking for an exact match? What's the mapping in these coreStringCaseIns and locStringCaseIns fields? If this is the expensive query, perhaps it should be moved?

Have you tried profiling the query, to get a sense of what takes time?
Does the query cache well, as in does it get faster quickly as you repeat the search?

Just a little typo :-

It's not 1.55 GB but 1.55 TB :slight_smile:

Hi theDor,

Thankyou so much that is quite crisp and clear!
Can you also suggest ideally how many shards a node should contain?

I would suggest you to read this blog about shards:


And for your question, the suggested amount of shards is 20 per 1GB Heap