Elasticsearch 6.5.4 - Kibana Coord Nodes OutOfMemory Errors And General Cluster Improvement Questions

I have tried my best to solve these issues myself and have learned a lot in trying to do so, but I'm having trouble knowing what to tackle first and the best way to go about it. I believe that the cluster setup below is not optimal, for both ES cluster architecture and indices configs.

Infrastructure Details (all hosted on VMware VMs):
4 Hot Data/Master Nodes:

  • 6 vCPU
  • 16GB RAM (50% for HEAP)
  • SSD storage (1TB each)
  • JVM version: "1.8.0_161"
  • GC mode: CMS

7 Cold Data Nodes:

  • 1 vCPU
  • 8GB RAM (50% for HEAP)
  • HDD storage (1TB each)
  • JVM version: "1.8.0_161"
  • GC mode: CMS

2 Kibana w/ 1 dedicated ES instance each as coord Nodes:

  • 4 vCPU
  • 8GB RAM (75% for HEAP)
  • JVM version: "1.8.0_191"
  • GC mode: G1GC

Cluster Stats:

  • Documents: ~6.5 billion
  • Shards: ~7,500
  • Average Shards per Index: ~5
  • Index count: 1,522 (past 9 months of data with daily indices @ ~10GB primaries)
  • Query cache mem size: 2.2GB
  • Segments: 59,941
  • Segments memory: 9.8GB
  • Average event rate: ~500 e/s

Obviously, there are too many shards and segments in the cluster. Unfortunately this cluster is still largely stuck on the defaults (5 shards, 1 replica). I have recently setup curator to shrink everything down to 1 shard and 1 replica and merge segments to 1 per shard. So that's being taken care of, albeit slowly.

I've also found other issues with our indices. A lot of them did not have the metricbeat-*, filebeat-*, and packetbeat-* index templates so many indexes were created with all fields as both text and keyword types; I'm thinking that's the reason for what I assume to be high segments memory (9.8GB) as there were some high cardinality fields set as keyword. I'm trying to reindex to get that lower that as I suspect that's the underlying reason for the crashes.

In an effort to fix the OutOfMemory issues, I updated the two Kibana es-coord nodes to the G1GC. I was not seeing the typical sawtooth pattern prior to doing so, but am seeing it now under low load (pictured briefly below).

To the actual issue:
When loading dashboards in Kibana (sometimes, not always) the elasticsearch coordination instances will spike JVM memory and get stuck in a long GC collection cycle before ultimately running out of HEAP space and exiting.

Here are three examples from today where I intentionally induced an OutOfMemory Error with the dashboards so I could go through the HEAP dumps:

I loaded up the heap dump in Eclipse's MAT, and very quickly realized I had no idea what I was doing. All I could figure out was that the char[] and java.lang.String objects seemed to take up the most shallow heap and that the org.elasticsearch.transport.Transport$ReponseHandlers objects were the largest in HEAP at 2.9GB at the time of the dump. I'm not sure how to utilize that information to improve the cluster or fix the issues.

I know that ES 7.x has better circuit breakers to prevent these events than ES 6.x, so I've been looking to upgrade, but definitely wanted to resolve some of these other issues first.

I apologize for the information dump and the lack of a specific question, but any advice or suggestions for how to improve our current cluster setup would be very much appreciated.

Thanks in advance,
James S.

I agree that you seem to have too many shards given the size of your data, so it is good to see that you are doing something about it. To address heap usage I would also recommend this webinar, although this would primary address heap issues on the data nodes. I do also think the cold data nodes look undersized and could benefit from another CPU core. Have you monitored CPU usage on these nodes when you are querying? If these are the bottleneck, could this perhaps lead to the coordinating nodes needeing to keep data in memory longer than necessary?

Given that it is the dedicated coordinating nodes that are having issues with heap I would also recommend looking at your dashboards and queries to check if you have any that use unnecessary amounts of RAM, e.g. by creating very large number of buckets. I doubt changing GC will have much effect so you may simply need to add more memory and heap to get around this.

Thank you Christian; I did end up watching the webinar which I found very informative, but as you said it didn't relate to coordinating nodes as much.

I have not monitored CPU usage on the cold nodes during querying yet. I hadn't considered doing so as I can cause these HEAP issues on the coordination nodes from only indices stored on the hot data nodes. Normally, I can use the default [Metricbeat System] Host Overview Dashboard with a window of less than 24 hours to cause the crash. My understanding was that while the shards needed to respond to the query, they didn't need to perform a lot of CPU intensive tasks as the time-filter should have eliminated them entirely (but perhaps Elasticsearch isn't built that way). I will try and capture some relevant data from these nodes during querying though.

I couldn't think of a methodical way to check which dashboards/visualizations may be creating a lot of buckets as some of them use the Visual Builder style which doesn't have an inspect button (from the same [Metricbeat System] Dashboards), so I tried to limit it in Kibana by setting the metrics:max_buckets` from the default 2000 to 100. However I suspect that it doesn't actually take the load off the Elasticsearch coordination node as it still caused them to crash.

The only other thing I've noticed during this is that while loading the visualizations a large number of long running tasks are shown under _cat/tasks with this action: indices:data/read/field_caps[index].

They all were taking more than 2 minutes. After realizing that that's likely related to the Field Capabilites API, I guess that this it back to having too many indices/shards/segments again?

So I did end up doing some performance monitoring across the "warm" nodes. I used the Infrastructure App as I knew with our setup it could take a long time to load and it didn't run the risk of using an intensive visualization that we made.

As you said @Christian_Dahlqvist, it appears the warm nodes are under powered as search/queries executed on them could take upwards of 10 minutes even when only trying to load a one hour window in the app. Using the /_tasks API I was able to extract the query being run by the Infrastructure App and looked at them in the DevTools search profiler. Most of the shards ran quickly, but others could take 20s+ trying to load the global ordinals (GlobalOrdinalsStringTermsAggregator).

Our cluster doesn't see a lot of indexing load (500 events/sec with rare peaks of 2k events/sec). I'm tempted to turn on eager_global_ordinals for certain fields in the indices to shift the load from query-time to index-time but I don't think that will help as much as increasing resources for the machines.

I did increase the Kibana timeout from the default of 30000 to 120000 and that seems to prevent the Coordinating nodes from suffering OOM events, although the RAM/Heap are still maxed out.

A few other things I've found that I'm wondering if they could alleviate these issues further:

  • Disk:RAM ratios
  • Creating dedicated masters instead of hot data + master
  • Index sorting by timestamp as all our data is time series

If you have issues with global ordinals, forcemerge indices that are no longer written to down to 1 segment. Dedicated master nodes is generally always recommended.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.