Slow querying of elasticsearch logs

We have an index with 90 primaries & 0 replicas and a new index is created each day. The volume grows up to 7-8TB per day. We have lifecycle policies where the index is moved from hot to warm nodes after 3 days and then it stays on warm nodes for a month. We use Kibana index patterns to visualize the log for a given time period. Generally, the queries are pretty fast when the logs are on hot nodes but when the logs are migrated to warm nodes, even the smallest window of logs takes forever to load (Queries timeout mostly)

Summary:
Index-dd-mm-yyyy (8TB per day)

  • Primaries=90
  • Replicas=0

Lifecycle policy:
Hot (3days) -> Warm (30 days)

ELK version: 7.8.0

Issue:
While filtering logs from Kibana using the index pattern time filter, the logs older than 3 days take forever to load. Even a 15 mins window of logs is not loaded.

What I researched and found out:
The following things should be done when the index moves to the warm phase

  • Make the indices read-only
  • Force merge the indices (Not sure if this could be done because the size of the index is too large and ES warns when we attempt to do so)
  • Rollover the indices when they reach (let's say 1 TB of volume) or increase primaries to reduce the shards sizes (Best practice is to keep the sizes in the 10-50 GB range)
  • Create replica shards (This would take an additional 8TB space and a lot of memory but will help optimize the searching speed)
  • Freeze other unused indices to free up memory resources

I am cautious while making any changes. Can you please suggest if this is the right approach or am I missing something? Thanks in advance

How many hot and warm nodes do you have? What is the specification of these nodes? What type of storage are you using?

Indexing into 90 shards can be inefficient and using daily indices also means each shard likely covers the full day.

If you have data coming arriving in near real time I would recommend switching to using rollover. This will allow you to have 1 or 2 primary shards per hot node and set a size threshold of 50GB. You will likely get more indices per day, but each index will cover a smaller time period. When these move to the warm tier and you query a 15 minute interval it is likely that a much smaller set of nodes will need to be searched which should speed things up.

You should be looking at using ILM to manage rollover, it will make it a lot easier to manage and your shard size will be a bit more reasonable.

This is EOL, please upgrade as a matter of urgency.

Thanks for your prompt response. We will definitely go with the rollover. Can you also comment on the other options I listed in the query above?

Here are the details that you asked for:
The cluster is hosted on AWS (EC2 instances).

  • 9 masters

  • 12 Kibana

  • 114 Hot

    • Instance type: r5.xlarge
    • Volume type: gp2
    • Volume: 1TB
    • HeapSize: 14g
  • 66 Warm

    • Instance type: r5.xlarge
    • Volume type: st1
    • Volume: 3TB
    • HeapSize: 14g

Thanks for your suggestion. We are already using ILM policies and it would be easy to manage rollover as well. Also, we have planned to upgrade the cluster soon. So it won't be an issue.

This is excessive. You should always aim to have 3 dedicated master nodes.

I would recommend upgrading. This version is very old and a lot of improvements have been added in newer versions.

Running a lot of small nodes is IMHO not ideal I always recommend first scaling up to 64GB RAM and 30GB heap before starting to scale out. I would therefore recommend you reduce the number of hot nodes and switch to r5.2xlarge instances with double the storage.

I see that you are using 1TB gp2 EBS volumes. These are not very fast (max 3000 IOPS if I recall correctly) and I would recommend using faster storage for the hot nodes. That may allow you to further reduce the number of hot nodes.

In order to reduce the number of nodes in the cluster I would recommend to also here switch to larger r5.2xlarge instance with more storage per node.

I see that you are using st1 EBS here. This tends to be very slow and IMHO not suitable for Elasticsearch warm nodes. I suspect you will find that the query performance of the warm nodes is limited by disk performance. Run iostat -x on these node to see what await and disk utilisation looks like. st1 EBS is optimised for large sequential reads and writes if I recall correctly, and this is not what an Elasticsearch load generally looks like. I would recommend switching to gp2/gp3 EBS for the warm nodes.