Stack monitoring with metricbeat procucing too much data? Getting timeouts

Hello. I have a problem that the Stack Monitoring page times out when trying to look at cluster overview at a period of 24h (or "today"). Cluster overview of "last 1 hour" works fine, but longer periods time out. Longer periods work IF I delete the underlying .monitoring-es* indexes and it starts collecting logs from zero... until the index again grows for some hours and it times out again.

So the cluster is on version 7.15.0, has 29 nodes. A separate metricbeat VM (7.15.1, Elasticsearch-xpack module enabled, scope: cluster, metricsets disabled) is querying the cluster for stats and forwards it to a separate monitoring cluster of three nodes. The monitoring cluster stores data in monitoring-es-7* indexes.

Looking at the monitoring-es-7* index pattern I can see that the index has 253 correctly premapped fields. But when going to the discover then in the fields panel I am shown that the index has 956 available fields of which most are "unknown fields". For example some unknown fields: "cluster_state.nodes.XYZ...XYZ.attributes.xpack.installed", "cluster_name", "cluster_stats.indices.query_cache.evictions" and so on. It seems like every added node creates at least 7 more unique unknown fields (with node-ID as part of the field) to the index. Adding some more nodes to cluster and I will hit the index limit of 1000 fields allowed.

Today the monitoring-es-7* index has grown to 2.2GB/617347 documents in about 7 hours and I'm already getting timeouts in stack monitoring.

Why is this happening? This could not possibly be expected behavior? I suspect that the amount of fields/data/docs may be causing timeouts when looking at stack monitoring overview.

I should mention that everything was working fine when the cluster and metricbeat was on 7.13.1... until upgraded both to 7.15.0.

Hi @heikis ,

The field count under cluster_state.nodes shouldn't contribute to field count since it doesn't contain any mappings.

You should be able to confirm this by checking GET .monitoring-es-*/_mapping and seeing something like this:

        "cluster_stats" : {
          "properties" : {
            "indices" : {
              "type" : "object"
            },
            "nodes" : {
              "type" : "object"
            }
          }
        },

I have a 7.15.1 3 node cluster here we're I'm able to pull 7 days of data, but the scale is quite a bit smaller.

Do you have elastic APM set up already? It might be worth configuring kibana to point to the APM server and see what comes up in the set of slow request transactions.

The config to put in kibana.yml would look like:

elastic.apm.active: true
elastic.apm.serverUrl: (APM SERVER ENDPOINT)
elastic.apm.secretToken: (APM TOKEN)
elastic.apm.centralConfig: false
elastic.apm.breakdownMetrics: false
elastic.apm.transactionSampleRate: 0.1
elastic.apm.metricsInterval: 120s
elastic.apm.captureSpanStackTraces: false

I don't think I have anything handy that'd let me compare page performance between 7.13 and 7.15 for a large number of nodes, but I'll see if I can dig up anything.

In the mean time if you can use APM to spot which ES query is slow and feed that into Profile queries and aggregations | Kibana Guide [7.15] | Elastic, it might help isolate the issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.