We have an indexing process that is indexing data from 4 servers each with about 10-20 processes. Each one does a bulk upload after it processes enough data. Because there are so many process, I would assume the insertions to be relatively smooth. but in marvel I see frequent periodic spikes in indexing, followed by zero activity, repeated.
Is this due to how my index is configured or some issue in marvel?
If the data uploading was happening at a constant and consistent rate, the line graph would show much less variation.
The marvel.agent.interval setting controls how frequently data samples are collected. If it is still at the default setting of 10 seconds, and there are 10-second periods of time where no data has been indexed, that would explain why you have zeros on your Y-axis.
I can only think that the your bulk uploads are in sync, and are collectively uploading data about every 37-38 seconds (5 minutes / 8).
If you want to smooth out the actual indexing rate, you could chunk up the data before the processes upload it, to break up the data into multiple bulk inserts.
If you just would like to see the Marvel chart smoothed out, you could try bumping up the marvel.agent.interval setting to 40 seconds.
There are up to 40-60 long-running scripts on 4 machines - each one takes a item from a query and does some queries, processes it, and then does a bulk upload once it reaches an internal queue limit (~1000 documents). It's inconceivable to me that these all "magically" sync up consistently.
I took a closer look and other internal metrics such as the latencies also drop to 0 at the same time.
Is there something else that could have an effect? On other indices I've changed the refresh interval to 30s (I don't think I've changed the .marvel-es - but maybe something like that? If you zoom into to the 5 minute resolution so you see sub-minute resolution are your graphs smooth?
Any other ideas? There is definitely something not right (it seems!)
If you zoom into to the 5 minute resolution so you see sub-minute resolution are your graphs smooth?
If you zoom into a fine-grained resolution, the graphs are not expected to be smooth because the metrics are queried using a derivative aggregation, where each data point represents a bucket of time. If there are no documents within that time bucket, then the line chart will go to 0 for that time bucket. Reference: Derivative aggregation | Elasticsearch Guide [8.11] | Elastic
I set up a test where I index 1 document roughly every 15 seconds. On the Indices overview, when my time range is 30 minutes, my charts look like what I would expect:
When I make the time range 15 minutes, the graphs show zeroes because some of the time buckets shown have 0 doc count - there was no data available to calculate the derivative:
Just to be clear - I'm indexing all the time - I'm currently running a reindex where I have 80 scripts running on 4 servers sending bulk requests. There is no way this graph represents reality.
Then I would consider this as a bug in Marvel. Marvel should either:
make sure to collect data every second so there is no missing data (normally - there will always be exceptions - but in this case it's the norm - every minute it misses data)
not return data for these points - and the graphing package should skip the nulls
apply a moving average to smooth out the missing data so the differential doesn't break.
2 and 3 are still problematic because it basically means the data is broken - if I have 4000 rq/s for 4 out of 6 data points when I zoom out I'll see the line as the average which is around 2600 (am I mistaken here?). even when zooming out - it shows jumpiness which is not real.
The real problem is 1 - which I assume is caused by a bug or misconfiguration in marvel-agent to not collect the data properly. In the chart above I'm collecting only 1 data point per minute.
I said every second - but what I really meant was "every time it's supposed to" - This is set to collect every 10 seconds - which is fine - but that fact that it's missing data is the problem.
Is the gap_policy for these aggregations set to skip ? If not, perhaps that would avoid part of the problem.
perhaps this is part of the problem - maybe 10 seconds is too fast, especially when the server is under load and it's dropping data? I'll try a higher value.
That's what I suspect here. Do you still have this issue?
If so, it would be nice to have the output of /_nodes/hot_threads and /_cluster/pending_tasks during the "peak" (you may need to run it multiple times).
I've pretty much given up on marvel. We still use it but it's really not as great of a tool as it was. While I understand the move to kibana 4 for the elastic ecosystem - it's a huge step backwards in practical usefulness. I haven't had time to install something else - but I'll probably just use a es-graphite/influx stats tool with a grafana dashboard.
I appreciate your frustration - we are continuing to improve Marvel with each release. I didn't see in this issue which version you are running, but we have dramatically improved things since the 2.0 release, and the upcoming 2.3 release makes a number of other big improvements (e.g. giving you flexibility in how you define node uniqueness, and big improvements to the underlying datamodel that we capture, opening the door for more rapidly building new features).
We're totally committed to making Marvel the best way to monitor the entire Elastic Stack, and we appreciate your patience (and your feedback!) as we continue to improve.
Please don't hesitate to open other threads if you run into other issues.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.