Cluster hangs for 1h. no logs, no throughput

ghs · June 23, 2017, 5:26am

I use 12 ES nodes. (3 master with data, 9 data only)
Java app is putting documents to ES using TransportClient's bulk API.
Suddenly, ES throughput didn't work for 1hour.

At that time,

One of the data nodes is hanging because of full GC. (Maybe)
App is sending documents using bulk API.
No throughput on whole cluster.
No bulk failure response.
No App log and ES log (level INFO).
Server CPU, Mem, Disk, Network are normal.
Timeout configuration is default.

IMO, Node can hang for many reasons. but other nodes can working. Bulk documents should be ingested by cluster. and Node can work after quited at GC, but it didn't.
Cluster didn't working for 1h, I lost my logs.
Cluster was start to work when I was killed that node.
I hope to know to what being happened in my cluster.

Christian_Dahlqvist · June 23, 2017, 5:46am

Which Elasticsearch version are you using? Do you have X-Pack monitoring installed? Is your indexing load evenly distributed across all data nodes? What is the specification of the cluster nodes? How much data/indices/shards do you have in the cluster?

ghs · June 23, 2017, 6:02am

Dear Christian.

I using 5.2.2 ES and not using X-Pack sadly.
Data distribution is not problem.

System information
CPU :
Intel(R) Xeon(R) CPU E5-2699 v3 (18core)
Intel(R) Xeon(R) CPU E5-2697 v4 (18core)

MEM :
128GB (ES using -Xmx60g)

DISK :
1 data node has 5 HDDs (1TB)

Index information
Index count : 115
Shard per Index : 40
Size per Index : 93GB
Docs per Index : 460million
Replicas : 1

Christian_Dahlqvist · June 23, 2017, 7:09am

Without any monitoring data it will be difficult to track down what happened. One thing that stands out though is that you have a heap size that is much larger than what we recommend. What are your heap and GC config for these nodes?

ghs · June 23, 2017, 7:39am

I discover one of additional situation when node being stuck.
A shard at stuck node had been initializing before node hang.
Is this factor can affect to problem?

JVM config
-Xms60g
-Xmx60g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-Xss1m
-XX:+HeapDumpOnOutOfMemoryError

Other options not configured.

I regret that I didn't install X-Pack on my server..

Christian_Dahlqvist · June 23, 2017, 8:50am

If there is nothing in the logs, having an issue affecting the cluster that long seems a bit odd to me. Are you monitoring the network and/or other server metrics?

ghs · June 26, 2017, 1:34am

Yes, I'm using app that I made using for server monitoring. It installed to every ES server and collect system & process information. (one of them connect ES master and collect stat & status info)
I was searching problem history during weekend and found this.

At that time (am11:49)

Log collecting app is inserting to current index.
Someone send searching request to ES.
Search threadpool increased. (3 to 13)
ES heap increased. (46g to 50g)
No GC log.
Throughput gone to 0.
ES's CPU, Memory and Network using less than before.
7.That time, Server system info shows nothing special.

I feel something strange at around pm12:50.
I turned on cerebro, I found current time index is being stuck for shard initializing around am11:49. (but index initialize has been success when be created).
And I open my app and ES log. I found no throughput log from my app. and I found one timeout log from one of ES node's log.
When I kill that ES node and restart, Data inserting has been start.
I consider this is very strange and wrote this post.

system · July 24, 2017, 1:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES hangs after some time Elasticsearch	4	536	July 6, 2017
ES 5.1.1 node stuck in endless loop halting the whole cluster Elasticsearch	6	1889	February 14, 2017
Elasticsearch - Poor cluster performance and stability Elasticsearch	8	1351	July 18, 2019
Elasticsearch getting stucked after few iterations Elasticsearch	5	925	July 5, 2017
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3538	March 3, 2018

Cluster hangs for 1h. no logs, no throughput

Related topics