Monitoring Cluster stopped receiving data from Production Cluster

TonyLuc · October 15, 2021, 10:16pm

OS: Windows Sever 2019
Observations are based on:

Metricbeat Version 7.15.0 (amd64), libbeat 7.15.0 [9023152025ec6251bc6b6c38009b309157f10f17 built 2021-09-16 03:28:25 +0000 UTC]
When MetricBeat Agent encountered an error on the Elastic Node, it reports the error to the Monitoring Cluster.
The message is stored in the index “metribeat-7.15.0”. (Note: Base on the mapping, it is not making use of the default metricbeat template)
When the agent created a new index, it encounters the following errors (as captured in the log file)

ERROR [publisher_pipeline_output] pipeline/output.go:154 Failed to connect to backoff(elasticsearch(http:// xx.xx.xx.xx:9200)): Connection marked as failed because the onConnect callback failed: resource 'metricbeat-7.15.0' exists, but it is not an alias
INFO [publisher_pipeline_output] pipeline/output.go:145 Attempting to reconnect to backoff(elasticsearch(http:// xx.xx.xx.xx:9200)) with X reconnect attempt(s)
The Monitor cluster stops receiving metrics from the agent.

Verifications: Metrics are store in indices “.monitoring-es-7-mb-%{+yyyy.MM.dd}”. No new indices are created.
Delete the index “metribeat-7.15.0” , (Restart Metricbeat Agent) and Monitor Cluster will resume collecting metrics.

Verifications: New indices indices “.monitoring-es-7-mb-%{+yyyy.MM.dd}” created.

When Metricbeat setup the default index template. The default index lifecycle write alias name is “metricbeat-%{[agent.version]}" = “metricbeat-7.15.0".
This alias crashes with the index created when Metricbeat encounters an error. This stops the agent from writing to the Monitoring Cluster.
It is likely that when the agent was writing the error, the index name wasn’t specify properly (i.e. the date math portion may be missing), it should be metribeat-%{[agent.version]}-%{+yyyy.MM.dd} instead of metribeat-%{[agent.version]}

Is the hypothesis correct in stating that a default configuration somewhere (minor bug) needs to be updated?
As a workaround, is there somewhere I can specify the index name for the agent to use when writing error messages?
Finally, I may be completely wrong, what may be the other causes or explanation?

Thanks for any helps or feedbacks

stephenb · October 16, 2021, 3:43pm

Take a look at this post I suspect you perhaps may be in the same situation.

TonyLuc · October 18, 2021, 5:03pm

Thanks Stephen, great explanations. You are right that I am in similar situation.

Right down to the facts that I need to shutdown all instances of the Metricbeat agents before deleting the incorrect index.

As usual I have proven to the community how little I know.

Now that I know “Why”, it’s time to figure out how it happens! (I have a feeling someone is going to own me lunch.)

Thanks again!

system · November 15, 2021, 7:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue in metricbeat 7.12 Beats metricbeat	3	592	April 25, 2021
Resource metricbeat-7.17.1 exists, but it is not an alias Beats ilm-index-lifecycle-management , metricbeat	7	1961	June 30, 2022
Problems monitoring cluster with metricbeat on 8.15 Beats elastic-stack-monitoring , metricbeat	4	120	November 27, 2024
Metricbeat writing to .ds-metricbeat instead of .ds-.monitoring Beats metricbeat	51	708	March 20, 2024
Monitoring Cluster not showing data properly Beats metricbeat	5	621	March 25, 2020