Monitoring Cluster stopped receiving data from Production Cluster

Observations

  • OS: Windows Sever 2019

  • Observations are based on:

    Metricbeat Version 7.15.0 (amd64), libbeat 7.15.0 [9023152025ec6251bc6b6c38009b309157f10f17 built 2021-09-16 03:28:25 +0000 UTC]

  • When MetricBeat Agent encountered an error on the Elastic Node, it reports the error to the Monitoring Cluster.

  • The message is stored in the index “metribeat-7.15.0”. (Note: Base on the mapping, it is not making use of the default metricbeat template)

  • When the agent created a new index, it encounters the following errors (as captured in the log file)

    ERROR [publisher_pipeline_output] pipeline/output.go:154 Failed to connect to backoff(Elasticsearch(http:// xx.xx.xx.xx:9200)): Connection marked as failed because the onConnect callback failed: resource 'metricbeat-7.15.0' exists, but it is not an alias
    INFO [publisher_pipeline_output] pipeline/output.go:145 Attempting to reconnect to backoff(Elasticsearch(http:// xx.xx.xx.xx:9200)) with X reconnect attempt(s)

  • The Monitor cluster stops receiving metrics from the agent.

    Verifications: Metrics are store in indices “.monitoring-es-7-mb-%{+yyyy.MM.dd}”. No new indices are created.

  • Delete the index “metribeat-7.15.0” , (Restart Metricbeat Agent) and Monitor Cluster will resume collecting metrics.

    Verifications: New indices indices “.monitoring-es-7-mb-%{+yyyy.MM.dd}” created.

Hypothesis

  • When Metricbeat setup the default index template. The default index lifecycle write alias name is “metricbeat-%{[agent.version]}" = “metricbeat-7.15.0".

  • This alias crashes with the index created when Metricbeat encounters an error. This stops the agent from writing to the Monitoring Cluster.

  • It is likely that when the agent was writing the error, the index name wasn’t specify properly (i.e. the date math portion may be missing), it should be metribeat-%{[agent.version]}-%{+yyyy.MM.dd} instead of metribeat-%{[agent.version]}

Questions

  • Is the hypothesis correct in stating that a default configuration somewhere (minor bug) needs to be updated?

  • As a workaround, is there somewhere I can specify the index name for the agent to use when writing error messages?

  • Finally, I may be completely wrong, what may be the other causes or explanation?

Thanks for any helps or feedbacks

Take a look at this post I suspect you perhaps may be in the same situation.

Thanks Stephen, great explanations. You are right that I am in similar situation.

Right down to the facts that I need to shutdown all instances of the Metricbeat agents before deleting the incorrect index.

As usual I have proven to the community how little I know.

Now that I know “Why”, it’s time to figure out how it happens! (I have a feeling someone is going to own me lunch.)

Thanks again!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.