Cluster hangs for 1h. no logs, no throughput

I use 12 ES nodes. (3 master with data, 9 data only)
Java app is putting documents to ES using TransportClient's bulk API.
Suddenly, ES throughput didn't work for 1hour.

At that time,

  1. One of the data nodes is hanging because of full GC. (Maybe)
  2. App is sending documents using bulk API.
  3. No throughput on whole cluster.
  4. No bulk failure response.
  5. No App log and ES log (level INFO).
  6. Server CPU, Mem, Disk, Network are normal.
  7. Timeout configuration is default.

IMO, Node can hang for many reasons. but other nodes can working. Bulk documents should be ingested by cluster. and Node can work after quited at GC, but it didn't.
Cluster didn't working for 1h, I lost my logs.
Cluster was start to work when I was killed that node.
I hope to know to what being happened in my cluster.

Which Elasticsearch version are you using? Do you have X-Pack monitoring installed? Is your indexing load evenly distributed across all data nodes? What is the specification of the cluster nodes? How much data/indices/shards do you have in the cluster?

Dear Christian.

I using 5.2.2 ES and not using X-Pack sadly.
Data distribution is not problem.

System information
CPU :
Intel(R) Xeon(R) CPU E5-2699 v3 (18core)
Intel(R) Xeon(R) CPU E5-2697 v4 (18core)

MEM :
128GB (ES using -Xmx60g)

DISK :
1 data node has 5 HDDs (1TB)

Index information
Index count : 115
Shard per Index : 40
Size per Index : 93GB
Docs per Index : 460million
Replicas : 1

Without any monitoring data it will be difficult to track down what happened. One thing that stands out though is that you have a heap size that is much larger than what we recommend. What are your heap and GC config for these nodes?

I discover one of additional situation when node being stuck.
A shard at stuck node had been initializing before node hang.
Is this factor can affect to problem?

JVM config
-Xms60g
-Xmx60g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-Xss1m
-XX:+HeapDumpOnOutOfMemoryError

Other options not configured.

I regret that I didn't install X-Pack on my server.. :disappointed_relieved:

If there is nothing in the logs, having an issue affecting the cluster that long seems a bit odd to me. Are you monitoring the network and/or other server metrics?

Yes, I'm using app that I made using for server monitoring. It installed to every ES server and collect system & process information. (one of them connect ES master and collect stat & status info)
I was searching problem history during weekend and found this.

At that time (am11:49)

  1. Log collecting app is inserting to current index.
  2. Someone send searching request to ES.
  3. Search threadpool increased. (3 to 13)
  4. ES heap increased. (46g to 50g)
  5. No GC log.
  6. Throughput gone to 0.
  7. ES's CPU, Memory and Network using less than before.
    7.That time, Server system info shows nothing special.

I feel something strange at around pm12:50.
I turned on cerebro, I found current time index is being stuck for shard initializing around am11:49. (but index initialize has been success when be created).
And I open my app and ES log. I found no throughput log from my app. and I found one timeout log from one of ES node's log.
When I kill that ES node and restart, Data inserting has been start.
I consider this is very strange and wrote this post.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.