We're attempting to use a dedicated Marvel cluster (2 c3.2xlarge instances) to monitor our ES cluster. We've recently expanded the cluster, and populated some data to increase the number of indices (>800) and shards (>6K). This will grow as we further populate the database. In the expansion, we've noticed our marvel indices growing from 3-5GB a day to over 50GB per day, even though the cluster has grown by roughly 2x in terms of node count. We run out of disk space with less than 1 day of Marvel data, whereas previously we were using curator to delete data older than 7 days.
We've tried changing 2 settings so far
marvel.agent.interval - defaults to 10s, changed to 30s. We changed this to reduce CPU usage on the cluster. This hasn't appeared to change the amount of data (we still run out of disk in < 1 day), but the nodes frequently appear as not having contacted Marvel recently (the ! next to the node name).
marvel.agent.indices - defaults to "*", we changed it to "" hoping to just turn off all index reporting and see what that does. There was no apparent change in data consumption. Due to the way indexes are allocated in the cluster, we cannot choose an index that exists on all nodes, and we would like to get node metrics for all nodes.
How can we reduce the data consumption of Marvel? Optimally I'd like to keep it down to around 10-15GB per day so we don't have to expand the monitoring cluster further; there's not a lot of utility in holding 50GB per day.
Our ES cluster is 1.4.3, same as the Marvel cluster.
I didn't record that then because we weren't having issues. I know the approximate data size per day because I had to configure curator to clean up old marvel indices. Best guess is approximately 100 indices and 700 shards, though it may be 2-3x higher because I don't know exactly when certain indices were added. So low end assumption -- we 2x'd the nodes, 8x'd indices and shards, and over 10x'd the marvel usage per day.
The "marvel.agent.indices" documentation hints that increasing indices will increase the data in Marvel, otherwise there wouldn't be a need to limit how much is exported. This is why I tried changing it as mentioned in the original post, though I don't think empty string worked right. Maybe it did and we're still getting this much data from the 26 nodes?
To reiterate, what I'm looking for is to decrease how much data we record while continuing to get high level data for all the nodes (individual index stats can suffer, if I know what the limitations are) or at least a method of projecting how much data will be used.
I covered that point in my initial post. By changing marvel.agent.interval from the default to 30s, I noticed a significant increase in the number notifications that the Marvel agent had not checked in for a node (the exclamation point next to the node name). This would come and go as you refreshed the page, making it very difficult while actively debugging the cluster for OOM nodes. I can only imagine that setting the interval for an hour would cause all nodes to notify as not checked in.
What we ended up changing in our version of Marvel was setting marvel.agent.indices: -*, which would exclude all indices from being uploaded (Leveraging the minus operator to exclude, and wildcard for all). This significantly cut down on the data being uploaded, though we lost other metrics relating to index and search rates. Do not set marvel.agent.indices to an index that does not exist, it will fill your log with errors.
We'll probably re-evaluate Marvel and its data consumption after migrating to the converged version releases, as soon as that's stable. For now we're using an external vendor to monitor. I have not evaluated any other incremental Marvel releases to know if anything has changed in operation since I posted my question.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.