High Impact of NodeStatsCollector.doCollect()

I have a test-cluster (6 Nodes) with a somewhat high number of indexes in the order to 600 and a high number of shards, up to 12 per index, I am doing some performance/limits testing and saw that NodeStatsCollector.doCollect() from Marvel is quite heavy, it seems it computes the file-sizes on disk for all indexes/shards, in a 120 second timeframe it used up 10 seconds of CPU time with only one instance of Marvel open. Refresh rate in Marvel was at the default of 10 seconds here.

Is this expected? Any way to reduce this other than lowering the refresh-rate of Marvel?

Hi Dominik,

If I'm doing my math right, that looks like about 1200 shards per node, which is really more than you want to have per-node. Remember that each shard has a small amount of overhead associated with it, but small*1200 starts to add up!

I suspect you'll run into much more serious issues than Marvel using CPU time to calculate the size of files on disk if you decide to run this way in production. That said, you can always turn down the Marvel monitoring frequency by setting marvel.agent.interval in your elasticsearch.yml on the nodes. By default, every 10 seconds, Marvel pulls monitoring information from every node in the cluster, and that is what causes the overhead. The Marvel UI simply queries this data, and should be putting nearly no load on your system


Thanks for the reply, yes, I know this is an artificial test with much to high parameters, but I would like to find out where the limitations of our setup are.

In general I find it quite hard to get any actual reference-numbers that work for people from all the documentation and blogs, not sure why every body seems to be a bit cagey about this.

Also I though maybe this is an easy point to improve a bit as from the stacktrace it seems that Marvel Agent is calling the no-ops constructor which always performs all stats, maybe if Marvel only needs some of these, the constructor where you specify the actual required stats via the flags would be better suited and could save some of the overhead...

You're right, some improvements can be made here. Thanks