ElasticSearch Node goes down

Sourabh · June 29, 2019, 6:46pm

I am having a cluster with 5 master nodes,12 coordinator nodes and 60 data nodes .Currently i am doing heavy indexing in this es cluster around 15 billion documents spread through the day.We have 3 index which are undergoing heavy indexing there are four rollover in a day for each indexes.Each indexes having 100 shards and the replica is set to 1.The nodes are up on a physical server having 200gb of RAM ,each nodes have around 32gb of heap and the translog durability is set to async.

The bulk indexing via bulk processor is happening very slowly with 32 clients and the batch size of the bulk processor is 7500 and bulk action is 20 and the bulk size is 25mb. All the interfaces have 10gb bandwidth.

But the bulk indexing is happening very slowly.
During indexing and searching these error are coming
- We have seen that many thread that putting the data are getting blocked in the add
  of the bulk processor.
- cluster is not responding properly and also the bulk indexing through bulk processor
  is also getting logged in the slow log.
- I am getting node disconnected,Receive Timeout Transport Exception when i doing bulk indexing or any executing other queries.
And in the logs i am getting failed to execute query phase(No search context found for id),gc errors , nodes being removed and added.

Please suggest .This is a very trivial issue we are facing.

Christian_Dahlqvist · June 29, 2019, 6:54pm

Why are you indexing into so many shards? Is that 100 primary or 100 primary and replica shards? What is the average shard size?

Sourabh · June 29, 2019, 6:55pm

I have 100 primary shards and the replica is set to 1.The average shard size is around 30 to 40gb.

Christian_Dahlqvist · June 29, 2019, 6:59pm

If you have 100 primary shards per index and the average shard size is 30GB you are generating 72TB (30GB * 100 primary shards * 2 (1 replica) * 3 indices * 4 rollovers) of data on disk per day. That is 2400 shards generated per day. To me this sounds a bit strange. Are you sure those numbers are accurate? This does not sound slow to me...

What is your retention period?

Sourabh · June 29, 2019, 7:05pm

Total number of shards is close to 8000 in the cluster at any given instant of time.Retention is d-2 days.Each document size in the index is approximately 1.8kb. We have opted for 100 shards so that the indexing is comparatively faster.The cluster is having index heavy load.Please guide where exactly the issue might be.

Christian_Dahlqvist · June 29, 2019, 7:08pm

Do you have any non-default settings? Can you confirm that the numbers given are accurate?

If the information is accurate I owuld recommend the following:

Create the indices with 60 primary shards and 1 replica. Set rollover to cut over at an average shard size of over 50GB. This will reduce the number of shards you are indexing into as well as the number of shards in the cluster.
As your nodes have prenty of RAM and you might be having heap pressure (check if this is the case) place 2 Elasticsearch nodes per host. This means each node will hold one shard per index on average.
If you can, make sure each bulk request only indexes into one index. This will result in more documents being indexed per shard per request. As Elasticsearch syncs the transaction log per request, this should improve efficiency.
If you have not already, install monitoring so you can see how heap usage looks like.

system · July 27, 2019, 7:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High bulk rejection on specific nodes Elasticsearch	6	1494	April 19, 2018
Bulk indexing slow down when data amount increase Elasticsearch	6	2990	July 6, 2017
ElasticSearch Bulk indexing is not scaling Elasticsearch	7	2963	July 5, 2017
Elasticsearch 6.0 bulk write is slower in a cluster but fast in single node setup Elasticsearch	5	1683	December 27, 2017
Elasticsearch cluster instability Elasticsearch	13	2872	July 6, 2017

ElasticSearch Node goes down

Related topics