Hi All,
I've got an issue with ElasticSearch, or more specifically I suspect with the JVM Heap it's using.
Before I go into detail overload, let me set out what I have.
The Setup
4 x Nodes (3 x Eligable masters & data noes, 1 x Client node).
Focusing on the master/data nodes they are running CentOS Linux release 7.2.1511 (Core), with java-1.8.0-openjdk-headless-1.8.0.91-1.b14.el7_2.x86_64 and ElasticSearch 2.3.4.
They are single core, 4 GB memory and 500 GB storage (specifically partitioned for the data of ES).
They are configured (correctly) to mlockall, the swappiness of the OS is 1 and the heap is set to 2 GB max and min.
On initial boot they seem to be working fine and for a little while they do.
Cluster shows healthy, and does so throughout.
Logging, to prevent swallowing storage of the root partition is set to INFO not DEBUG.
The Issue
The first symptom I start to experience, at least that I notice, is Kibana reporting that plugin:elasticsearch "Request Timeout after 30000ms".
I then notice slow responses to any APIs which need to poll the members, and I believe it's potentially stopping ElasticSearch processing correctly as a cluster.
The issue is occurring right now and if I try to run a direct /_cat API against the client node it takes quite some time to load.
When it has, it provides the following output - all good so far as I can tell:
host ip heap.percent ram.percent load node.role master name
10.0.0.7 10.0.0.7 7 69 0.07 - - NODE7
10.0.0.2 10.0.0.2 91 96 1.22 d * NODE2
10.0.0.1 10.0.0.1 91 91 0.61 d m NODE1
10.0.0.3 10.0.0.3 99 94 1.57 d m NODE3
Tailed logs from NODE1:
[2016-08-06 03:46:09,951][WARN ][monitor.jvm ] [NODE1] [gc][young][48296][58732] duration [1.9s], collections [1]/[2.4s], total [1.9s]/[33.1m], memory [1.3gb]->[1.3gb]/[1.9gb], all_pools {[young] [18mb]->[4.8mb]/[66.5mb]}{[survivor] [4.7mb]->[7.9mb]/[8.3mb]}{[old] [1.3gb]->[1.3gb]/[1.9gb]}
Tailed logs from NODE2:
[2016-08-11 09:17:35,415][WARN ][transport ] [NODE2] Received response for a request that has timed out, sent [81266ms] ago, timed out [66266ms] ago, action [cluster:monitor/nodes/stats[n]], node [{NODE3}{f29G8IbhSYevQcDPrDUqyg}{10.0.0.3}{10.0.0.3:9300}{max_local_storage_nodes=1}], id [4506258]
[2016-08-11 09:17:35,415][WARN ][transport ] [NODE2] Received response for a request that has timed out, sent [141270ms] ago, timed out [126267ms] ago, action [cluster:monitor/nodes/stats[n]], node [{NODE3}{f29G8IbhSYevQcDPrDUqyg}{10.0.0.3}{10.0.0.3:9300}{max_local_storage_nodes=1}], id [4506107]
Tailed logs from NODE3:
[2016-08-11 09:12:34,800][INFO ][monitor.jvm ] [NODE3] [gc][old][370335][68699] duration [6.4s], collections [1]/[6.6s], total [6.4s]/[1.7d], memory [1.9gb]->[1.9gb]/[1.9gb], all_pools {[young] [59.3mb]->[56.2mb]/[66.5mb]}{[survivor] [0b]->[0b]/[8.3mb]}{[old] [1.9gb]->[1.9gb]/[1.9gb]}
[2016-08-11 09:12:46,014][INFO ][monitor.jvm ] [NODE3] [gc][old][370337][68701] duration [6.6s], collections [1]/[6.7s], total [6.6s]/[1.7d], memory [1.9gb]->[1.9gb]/[1.9gb], all_pools {[young] [59mb]->[62.2mb]/[66.5mb]}{[survivor] [0b]->[0b]/[8.3mb]}{[old] [1.9gb]->[1.9gb]/[1.9gb]}
The first thing I notice from the above log snippets is that NODE3 is being complained about in the NODE2 (current master) logs.
Looking at NODE3, I'd say something doesn't look right with how it's garbage collecting. My knowledge of working with JVM is very minimal and I'm trying to learn to support it correctly for ES.
So... Questions at this stage:
- What triggers garbage collection?
- Why only NODE3? All indices are shard balanced over the three nodes. Balancing is enabled.
- Why did NODE1 have a warning on GC, NODE3 only INFO?
- What can I do to prevent this? Is this a sign that I need to scale up/out?
- What should I be monitoring to predict that before we get to this stage?
- Am I even looking at the right thing?
If I was to restart NODE3, I suspect it would restore service.
But only temporarily, and the last time this occurred it was NODE1 as master and NODE2 with the issue.
I've already looked at these threads for answers, which helped me in as much as writing my questions:
Any advice and answers to my questions is massively appreciated.
Best regards,
jdmac