ElasticSearch Node goes down

I am having a cluster with 5 master nodes,12 coordinator nodes and 60 data nodes .Currently i am doing heavy indexing in this es cluster around 15 billion documents spread through the day.We have 3 index which are undergoing heavy indexing there are four rollover in a day for each indexes.Each indexes having 100 shards and the replica is set to 1.The nodes are up on a physical server having 200gb of RAM ,each nodes have around 32gb of heap and the translog durability is set to async.

The bulk indexing via bulk processor is happening very slowly with 32 clients and the batch size of the bulk processor is 7500 and bulk action is 20 and the bulk size is 25mb. All the interfaces have 10gb bandwidth.

  • But the bulk indexing is happening very slowly.

  • During indexing and searching these error are coming

    • We have seen that many thread that putting the data are getting blocked in the add
      of the bulk processor.
    • cluster is not responding properly and also the bulk indexing through bulk processor
      is also getting logged in the slow log.
    • I am getting node disconnected,Receive Timeout Transport Exception when i doing bulk indexing or any executing other queries.
  • And in the logs i am getting failed to execute query phase(No search context found for id),gc errors , nodes being removed and added.

Please suggest .This is a very trivial issue we are facing.

1 Like

Why are you indexing into so many shards? Is that 100 primary or 100 primary and replica shards? What is the average shard size?

I have 100 primary shards and the replica is set to 1.The average shard size is around 30 to 40gb.

If you have 100 primary shards per index and the average shard size is 30GB you are generating 72TB (30GB * 100 primary shards * 2 (1 replica) * 3 indices * 4 rollovers) of data on disk per day. That is 2400 shards generated per day. To me this sounds a bit strange. Are you sure those numbers are accurate? This does not sound slow to me...

What is your retention period?

Total number of shards is close to 8000 in the cluster at any given instant of time.Retention is d-2 days.Each document size in the index is approximately 1.8kb. We have opted for 100 shards so that the indexing is comparatively faster.The cluster is having index heavy load.Please guide where exactly the issue might be.

Do you have any non-default settings? Can you confirm that the numbers given are accurate?

If the information is accurate I owuld recommend the following:

  • Create the indices with 60 primary shards and 1 replica. Set rollover to cut over at an average shard size of over 50GB. This will reduce the number of shards you are indexing into as well as the number of shards in the cluster.
  • As your nodes have prenty of RAM and you might be having heap pressure (check if this is the case) place 2 Elasticsearch nodes per host. This means each node will hold one shard per index on average.
  • If you can, make sure each bulk request only indexes into one index. This will result in more documents being indexed per shard per request. As Elasticsearch syncs the transaction log per request, this should improve efficiency.
  • If you have not already, install monitoring so you can see how heap usage looks like.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.