Sharding Strategy when data is in Tb

Hi Community

I have a cluster of 13 nodes

  1. Six Hot nodes each with 32gb heap memory , 6tb SSD Storage, 9 CPU , RAM 125gb .

  2. Three Master nodes 32gb Heap Size , 3 CPU , 50gb Storage

  3. Three Warm nodes each with 15tb of Storage , 32gb Heap Memory, 3 CPU

We have 2 index A and B . Every Month end the data size will be 600gb for index A and for index B it is 500Gb. Yesterday we have recorded 95 Cr records for both index and data size was 2.8Tb .

We have set 6 primary shards for both index but sometimes our shard size varies between 83 gb to 200gb .

Searching these index takes time . I don't know how much primary shard i can set as i don't know how much data i will get. Every month end and Every first week of month we get lots of data in gb roughly I can say 500 to 800gb. For one index.

Is your data immutable?

What is your retention period for the data?

How are you currently organising your shards?

Which version of Elasticsearch are you using?

How much RAM does the warm nodes have?

A few notes.

32 GB of Heap may be too much, it is recommended to stay under the threshold for compressed ordinary object pointes (oops), and this normally is under 30 GB, going over it may impact in performance, so I would recommend that you reduce the JVM Heap to something around 30 GB as recommended in the documentation.

The same thing apply for all your other nodes, if the heap is over compressed oops, this may impact in performance.

Also, your master node specs are over the necessary, you probably can reduce it for something like 4 GB of heap for each master node.

Do you also have replicas?

Since you have hot and warm nodes, are you using ILM to move the indices between the data tiers?

I would say that the best approach is to use ILM with rolllover and let elasticsearch take care of it.

Since you have 6 hot nodes, configure your indices to have 6 primary shards and the rollover to create a new backing index when a primary shard reachs 50 GB , you can also configure the rollover to create a new index based on date, and if a primary shard does not reach 50 GB before the age, it will also create a new index.

1 Like

We are using 8.11.4 . We are using ILM for moving the indices from hot to warm . If we enable to rollover then we have more number of shards in a cluster.

We have DC / DR (Active-Active) ELK clusters and we are using kafka mirror for synching. Our DC ELK cluster is working smooth 80000/s indexing rate not effecting CPU saturation. But in DR we have 2 days of lag and searching is poor it will not give search result in kibana discover. Both have same design.

I am not able to figure it out if write operation is going we can't do search operation.

When it comes to Elasticsearch performance CPU is often not the limiting factor. If you are indexing heavily and also querying, it is often storage performance that becomes the bottleneck.

Do these 2 clusters have exactly the same hardware, including storage? Are they the same size and configuration?

Having a larger number of shards that are close to the optimal size is generally preferable to having fewer shards that are too large, so I would recommend enabling rollover as Leandro said.

The Hardware is same in both clusters. Why our DC is working fine and DR which is just exact copy of DC is not working properly. In coming next 2 months 3 nodes we will add in both clusters.

If you are feeding both clusters off Kafka with the same configuration and concurrency and the clusters have the same hardware there should indeed not be any difference assuming the load is the same.

I would recommend checking the following:

  • Verify that there is enough indexing data available in Kafka to the second cluster and that the mirroring is not the issue. If there is a backlog in the mirrored Kafka topics the issue may be with the Elasticsearch cluster. This is after all one thing that differs between the 2 environments.
  • Collect and compare cluster stats from the two clusters. Compare data size, index throughput, query load, query latencies and system metrics like CPU usage and storage I/O.