Possible bug with monitoring creating thousands of elasticsearch.index.recovery duplicate docs

Since upgrading to 8.x and (via cloud.elastic.co) enabling Monitoring > Logs and metrics > Ship to a deployment, I was surprised at the volume of data arriving into the .monitoring-es-8-mb datastream.

On investigation it seems there is a lot of duplicate data being created about index recovery which happened in the past (see below).

I started by investigating the most common kinds of documents arriving in that datastream broken down by event.dataset and they are ~60% elasticsearch.index.recovery, ~35% elasticsearch.index and then a few little other bits.

Digging into the event.dataset=elasticsearch.index.recovery ones >98% of them have elasticsearch.index.recovery.type=PEER and basically all of them have elasticsearch.index.recovery.stage=DONE which started to seem odd - why so many docs about things which have happened but none about things in progress?

Filtering for 1 specific value of elasticsearch.index.recovery.name (in my case for example .ds-my.index.name-YYYY.MM.DD-00000N) I found that there were 2 records being created every 10 seconds for this, with apparently no new data in them.

In my case there were only 2 unique values each for elasticsearch.index.recovery.start_time.ms and elasticsearch.index.recovery.stop_time.ms so it looks like the same pair of events are being duplicated every 10 seconds.

Why this seeming waste of effort space (which of course we're paying for)?

Is this expected / a known issue / needs a ticket? I couldn't find anything in my search so far.

Thanks in advance!

P.S. looking separately at the event.dataset=elasticsearch.index (as opposed to the ....recovery ones) that seems to have 1 record per index every 10 seconds - although this too seems excessive for ancient indices in the cold tier, at least I can see some use in recording potentially-changing info (e.g. elasticsearch.index.total.search.query_time_in_millis might change over time even for an older index) whereas the recovery ones mentioned above seem to be 100% identical records of a past event so I'm not sure those need re-recording every 10 seconds.

I came across Elasticsearch index_recovery metricset | Metricbeat Reference [8.11] | Elastic which made me wonder whether the problem is metricbeat (which is presumably the underlying thing in use here) being configured wrongly:

By default only data about indices which are under active recovery are fetched. To gather data about all indices set index_recovery.active_only: false

so is the way elastic.co configures metricbeat for shipping metrics setting index_recovery.active_only: false perhaps unnecessarily?

I'm wondering which GitHub repo to report this as an issue on, having not received a reply here - could someone point me in the right direction?

I am assuming it's not a bug in metricbeat itself, just in the way that cloud.elastic.co hosted instances are configuring it for metric shipping.

Sorry for the delay in replying here, I am chasing this up internally.

Ok this is known and there's work on it here - Index recovery API time-based filtering · Issue #93463 · elastic/elasticsearch · GitHub

Ah interesting, so from what's written there it sounds like the cloud.elastic.co config is deliberately requesting to include completed recoveries (not just active).

It feels excessive to get and record this data every 10 seconds in the monitoring indices given it is not going to change.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.