I am looking for any tips or advice on how to troubleshoot our Elasticsearch cluster. This cluster has been running flawlessly with only minor maintenance for a couple years. We suddenly experienced an outage a week ago today. I have been struggling to keep it running ever since (it is actually our dev environment, so not production, but our devs are impacted, and we are worried the same thing can happen in production).
The symptom is this:
We have three client nodes that coordinate requests with the data nodes. As soon as there is any kind of traffic, I see constant Garbage Collection in the logs. Example:
[2019-08-09T00:38:15,835][WARN ][o.e.m.j.JvmGcMonitorService] [client-vm0] [gc][1034] overhead, spent [694ms] collecting in the last [1s]
[2019-08-09T00:38:17,066][WARN ][o.e.m.j.JvmGcMonitorService] [client-vm0] [gc][1035] overhead, spent [693ms] collecting in the last [1.2s]
[2019-08-09T00:38:18,079][INFO ][o.e.m.j.JvmGcMonitorService] [client-vm0] [gc][1036] overhead, spent [352ms] collecting in the last [1s]
At some point the client loses communication with the master:
[2019-08-06T14:38:54,403][INFO ][o.e.d.z.ZenDiscovery ] [client-vm0] master_left [{master-vm1}{PZJChTgxT46h4YYOqMr2fg}{1G3fXiSMQ5auVH-i5RH10w}{10.0.0.11}{10.0.0.11:9300}{ml.machine_memory=30064300032, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-08-06T14:38:54,419][WARN ][o.e.d.z.ZenDiscovery ] [client-vm0] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:…
[2019-08-06T14:38:54,434][INFO ][o.e.x.w.WatcherService ] [client-vm0] stopping watch service, reason [no master node]
It tries to find another master, but is unable to:
[2019-08-06T14:41:56,528][WARN ][o.e.d.z.ZenDiscovery ] [client-vm0] not enough master nodes discovered during pinging (found [], but needed [2]), pinging again
During this time, the masters and all the data nodes are perfectly find. I have usually seen the above when there is a large GC and the * master * loses contact with the * client * because it is too busy with GC to respond. But in this case, the client can’t find the master.
Eventually the client suffers an Out Of Memory failure and the JVM crashes. I am assuming that the memory issues, the GC and the crashing are all related, but I am having a problem figuring out what the cause is, and why so sudden.
Cluster details:
Elasticsearch
Version: 6.3.2
License: Open Source
Nodes:
Client (3) (D13_v2): 8 CPU; 56 GB RAM; HDD Drives
Master (3) (D4_v2): 8 CPU; 28 GB RAM; HDD Drives
Data (35) (DS13): 8 CPU; 56 GB RAM; SSD OS & 3x1TB SSD data drives
Indexes:
Taxonomy:
Size: 1.45 GB
Shards: 2
Replicas: 1
Support:
Size: 365 GB
Shards: 30
Replicas: 1
NonSupport:
Size: 2 TB
Shards: 80
Replicas: 1
JVM
Java Version: (build 1.8.0_144-b01) Note: we stayed with this build as it was what had been running for most of the last year and we wanted to start from a good state.
ES_HEAP_SIZE: 28672m (roughly half available memory)
Note also that we have a current issue with field mapping explosion that has grown. This may be a culprit that we are investigating.
If I shut off external access to the cluster, then everything is roses. When I open back up again, the clients go down in minutes. If I clear out all pending requests that I can see (we have a lot of queue based traffic), then it seems to be fine for some measure of time (~12 hours), but then it must reach some load where things fall over again.
I have tried modifying the heap size to try to adjust GC time. I have reduced mappings in the taxonomy index, and that seemed to help.
My current questions:
- Are there any immediate things I should be checking right off the bat?
- What should I expect to see in logs for GC? Is it normal to have 75% of the log filled with GC entries?
- In these scenarios is it better to lower the heap so the GC is not as long, even if it means more frequent GC
I also find that I am unable to inspect the hprof crash dumps as they are just too huge (~45GB), so I can’t get any info there on why the JVM might be crashing.
I set up DataDog when I started the investigation, but it has a lot of data I don’t know how to interpret
Any advice would be appreciated.
Thanks,
~john