I am running a cluster with 20 9TB data nodes (32GB heap), 3 client nodes (24GB heap), and 3 master nodes (20GB heap). This cluster has been running stable under monitoring for the past 90 days. Yesterday, nowhere near a UTC rollover event which would create new daily indices, we started seeing ramped up GC time on the masters. The GCs are more frequent and taking longer, here's a graph:
What can I do to debug this issue? In the past it has been a problem with out-of-control schema growth from poorly processed data. But here I'm not sure, how can I track down how memory is being used in the master?
Turns out one of our analysts was logging bro data directly to the cluster and had turned it on right at the time we saw this issue start. Disabling their data flow got the master under control again.
We are still trying to figure out what is causing the bro IDS data to induce master GC pressure.
I've seen this kind of thing when there is a field explosion due to inadvisably structured documents being indexed, i.e. something that should be a field value is in the documents as a field name, causing a high rate of dynamic addition of fields to your mapping. A particular tell for this (in addition to looking at the mapping and seeing thousands of fields) is that heap usage is extraordinarily high even on the unelected masters.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.