Query Optimization

We have Elastic v 7.8.0
We have monthly indices for each client, i.e: client_data_{cliendId}_2020-09.
1 primary shard, 1 replica for each index.

One of the queries are taking too much time, and I'm trying understand why and how to optimize it.
here is the query:

GET client_data_XXX_2020-09/_search?ignore_unavailable=true
{
  "track_total_hits": true,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "date_time": {
              "gte": "2020-09-01T00:00:00",
              "lt": "2020-09-30T00:00:00"
            }
          }
        },
        {
          "term": {
            "domain": "www.example.com"
          }
        }
      ]
    }
  },
  "aggs": {
    "date": {
      "date_histogram": {
        "field": "date_time",
        "fixed_interval": "1d",
        "min_doc_count": 1
      },
      "aggs": {
        "plt": {
          "avg": {
            "field": "plt"
          }
        }
      }
    }
  }
}

this is the basic version of the query and it can take up to 10s, the complex version of the same query can take up to 40s, on index with around 170 million documents.

I thought about aggregating the data daily, but we have some more complex versions of this query, for example, adding more two sub aggregations after the date_histogram, and we will end up with many aggregations, more memory... not the perfect idea.

Any other ideas ? :slight_smile:

Hey,

there's quite a bit of different factors at play here. First, you can remove the track_total_hits search parameter, as aggs always need to track the total hits.

So, as you are running this query against an index with one shard and one replica, the performance of this query will always only be as fast as a single node, as only a single shard gets queried. You could try to have more than one primary shard and parallelize the aggregation across several nodes, depending on the size of your cluster.

Also, is that search only slow once, because the data needs to be loaded and put into the filesystem cache, or is it also slow, if you call it several times in a row (note that request will probably alternative between the primary and the replica shard)?

Do you notice garbage collections happening when you aggregate on your data, that indicates you do not have enough heap? Also how much memory is there for the filesystem cache and how big is that index on disk?

How many queries are running in parallel that might result in this one waiting in the search queue?

Also using the profile API might help to uncover where things are slow.

hope this helps as a start.