I have tried my best to solve these issues myself and have learned a lot in trying to do so, but I'm having trouble knowing what to tackle first and the best way to go about it. I believe that the cluster setup below is not optimal, for both ES cluster architecture and indices configs.
Infrastructure Details (all hosted on VMware VMs):
4 Hot Data/Master Nodes:
- 6 vCPU
- 16GB RAM (50% for HEAP)
- SSD storage (1TB each)
- JVM version: "1.8.0_161"
- GC mode: CMS
7 Cold Data Nodes:
- 1 vCPU
- 8GB RAM (50% for HEAP)
- HDD storage (1TB each)
- JVM version: "1.8.0_161"
- GC mode: CMS
2 Kibana w/ 1 dedicated ES instance each as coord Nodes:
- 4 vCPU
- 8GB RAM (75% for HEAP)
- JVM version: "1.8.0_191"
- GC mode: G1GC
Cluster Stats:
- Documents: ~6.5 billion
- Shards: ~7,500
- Average Shards per Index: ~5
- Index count: 1,522 (past 9 months of data with daily indices @ ~10GB primaries)
- Query cache mem size: 2.2GB
- Segments: 59,941
- Segments memory: 9.8GB
- Average event rate: ~500 e/s
Obviously, there are too many shards and segments in the cluster. Unfortunately this cluster is still largely stuck on the defaults (5 shards, 1 replica). I have recently setup curator to shrink everything down to 1 shard and 1 replica and merge segments to 1 per shard. So that's being taken care of, albeit slowly.
I've also found other issues with our indices. A lot of them did not have the metricbeat-*, filebeat-*, and packetbeat-* index templates so many indexes were created with all fields as both text and keyword types; I'm thinking that's the reason for what I assume to be high segments memory (9.8GB) as there were some high cardinality fields set as keyword. I'm trying to reindex to get that lower that as I suspect that's the underlying reason for the crashes.
In an effort to fix the OutOfMemory issues, I updated the two Kibana es-coord nodes to the G1GC. I was not seeing the typical sawtooth pattern prior to doing so, but am seeing it now under low load (pictured briefly below).
To the actual issue:
When loading dashboards in Kibana (sometimes, not always) the elasticsearch coordination instances will spike JVM memory and get stuck in a long GC collection cycle before ultimately running out of HEAP space and exiting.
Here are three examples from today where I intentionally induced an OutOfMemory Error with the dashboards so I could go through the HEAP dumps:
I loaded up the heap dump in Eclipse's MAT, and very quickly realized I had no idea what I was doing. All I could figure out was that the char[]
and java.lang.String
objects seemed to take up the most shallow heap and that the org.elasticsearch.transport.Transport$ReponseHandlers
objects were the largest in HEAP at 2.9GB at the time of the dump. I'm not sure how to utilize that information to improve the cluster or fix the issues.
I know that ES 7.x has better circuit breakers to prevent these events than ES 6.x, so I've been looking to upgrade, but definitely wanted to resolve some of these other issues first.
I apologize for the information dump and the lack of a specific question, but any advice or suggestions for how to improve our current cluster setup would be very much appreciated.
Thanks in advance,
James S.