Kibana monitoring gaps when data node is replaced

Barak · December 11, 2019, 8:25am

Hi,

I am using ES 7.2.0 on AWS spot instances.
Cluster is comprised of 5 Master nodes, 8 Data nodes and 2 Coordinators.

I noticed that whenever a data node is replaced (due to spot replacement), there are monitoring gaps in the whole cluster

During the gaps I tested GET .monitoring-es-7-/_search* and it returned valid information.
The gap usually starts when the node goes up, until the EBS was warmed up and it's read latency is reduced (usually takes ~20-30 mins).

Note that the cluster is functioning properly during the gaps periods, and it is in yellow state, as a few shards are unassigned for this period.

Question is, why is the monitoring of the whole cluster affected by this one node that may not respond to these monitoring queries?

Thanks

chrisronline · December 11, 2019, 5:59pm

Hi @Barak,

A couple of things we should double check:

Are your master nodes exclusively master nodes?
During this period of time, are there any logs indicating discovery (or other) issues on the master nodes?

Barak · December 15, 2019, 1:20pm

Sorry for the late response @chrisronline.

Yes they are exclusive.
I found the below logs on the master node.

It looks like the new data node that joined (ip-172-30-1-6.ec2.internal) got disconnected a few times, this may be caused due to intensive i/o on the disk in the initial steps, however what I find odd is that it affects the monitoring of the whole cluster, instead of affecting the monitoring of that specific node only.

[2019-12-15T12:02:04,886][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-2-69.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{jENiFLyySW6bymfiCgLvSg}{17
2.30.2.69}{172.30.2.69:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41094, reason: ApplyCom
mitRequest{term=37, version=41094, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:13:06,893][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.
30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41115, reason: ApplyCommi
tRequest{term=37, version=41115, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-e
ast-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:27:16,545][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{17
2.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41126, reason: ApplyCom
mitRequest{term=37, version=41126, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:27:16,546][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:109) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:112) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$8.run(TransportService.java:973) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
        ... 11 more
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ip-172-30-0-16.ec2.internal][172.30.0.16:9300][indices:data/write/bulk] disconnected
[2019-12-15T12:27:29,357][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.
30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41138, reason: ApplyCommi
tRequest{term=37, version=41138, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-e
ast-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:31:04,574][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{17
2.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41146, reason: ApplyCom
mitRequest{term=37, version=41146, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:31:04,575][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:109) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:112) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$8.run(TransportService.java:973) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
        ... 11 more
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ip-172-30-0-16.ec2.internal][172.30.0.16:9300][indices:data/write/bulk] disconnected
[2019-12-15T12:33:20,780][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41161, reason: ApplyCommitRequest{term=37, version=41161, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:36:02,463][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41162, reason: ApplyCommitRequest{term=37, version=41162, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:36:02,463][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks

Barak · December 16, 2019, 12:57pm

In addition, I can see the below on the data node.
The node says "master not discovered yet" and then lists all the master nodes.

After digging a bit I found a similar issue caused by loading the global ordinals on large shards, however from my understanding global ordinals are loaded only after a search that contained aggregations, and our use case does not involve such.

[2019-12-16T11:51:32,570][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-2-197.ec2.internal] collector [node_stats] timed out when collecting data
[2019-12-16T11:51:36,709][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-172-30-2-197.ec2.internal] master not discovered yet: have discovered [{ip-172-30-1-6.ec2.internal}{Xd6Ex6OXRdac4pwZCxyTEQ}{yykEZM_BQaiyFoXTLYeF0A}{172.30.1.6}{172.30.1.6:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-48.ec2.internal}{FcGYX-cHSNSnszWd2xO0Rg}{c8_lsSoeTGu1FTCGOOlxmQ}{172.30.2.48}{172.30.2.48:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-54.ec2.internal}{qN0zCaZhTCGPGu849drP1Q}{3LTpzVViSICI6BbMRoyLow}{172.30.0.54}{172.30.0.54:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-248.ec2.internal}{EODGswyyT0WpM8FMUcIRSQ}{Cs83E-saTwmxLLblIiW_Qw}{172.30.0.248}{172.30.0.248:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 172.30.0.37:9300, 172.30.0.148:9300, 172.30.0.54:9300, 172.30.0.248:9300, 172.30.1.6:9300, 172.30.1.176:9300, 172.30.2.48:9300] from hosts providers and [{ip-172-30-1-6.ec2.internal}{Xd6Ex6OXRdac4pwZCxyTEQ}{yykEZM_BQaiyFoXTLYeF0A}{172.30.1.6}{172.30.1.6:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-15.ec2.internal}{1ogTu-_qQcCMCwLATVjDpg}{bneaD8Q0TEeVX7fAeI77WQ}{172.30.0.15}{172.30.0.15:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-198.ec2.internal}{Pulg8N-8R0KfX5yUwiC_dQ}{yNSnzzDtRiqj50sJkyNxyQ}{172.30.2.198}{172.30.2.198:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-148.ec2.internal}{zJiqfiIZQIGtELtjY7jnxg}{T5-yHwR_ShafZognjIj7-w}{172.30.0.148}{172.30.0.148:9300}{ml.machine_memory=8362668032, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-217.ec2.internal}{MAgVF9bpRC6lAOwNxWNj7A}{bdauGFO9SeGunEVrnmwayw}{172.30.0.217}{172.30.0.217:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-242.ec2.internal}{5SmrUGH4SRuLDYLdrOt77g}{G1v6wDcaR6iio_tr4ajLuw}{172.30.1.242}{172.30.1.242:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-48.ec2.internal}{rcWaUYZTSUaAviidjIESRA}{N1xKewxMTkuta5wXSZivYg}{172.30.1.48}{172.30.1.48:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-94.ec2.internal}{lYPBVwzbQhuprJql2Lamdg}{Ic2TAyPdR1GyCKLugB0S8Q}{172.30.1.94}{172.30.1.94:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-48.ec2.internal}{FcGYX-cHSNSnszWd2xO0Rg}{c8_lsSoeTGu1FTCGOOlxmQ}{172.30.2.48}{172.30.2.48:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-37.ec2.internal}{If_ODG8WQnKKWLiU7PmRdw}{jjpe0Q7uQ9ib7G4sSKNDXw}{172.30.0.37}{172.30.0.37:9300}{ml.machine_memory=8362668032, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-54.ec2.internal}{qN0zCaZhTCGPGu849drP1Q}{3LTpzVViSICI6BbMRoyLow}{172.30.0.54}{172.30.0.54:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-91.ec2.internal}{hkMoT0WES-C2F-OnUnc-_A}{CfbGhwBoTCaWqjkQjVVDfw}{172.30.2.91}{172.30.2.91:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-197.ec2.internal}{e0bwfaScSHmflJ1MSqXdxA}{dLHLLY6zSfC9Hi9UST9-dA}{172.30.2.197}{172.30.2.197:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=67534430208, xpack.installed=true, ml.max_open_jobs=20}, {ip-172-30-0-248.ec2.internal}{EODGswyyT0WpM8FMUcIRSQ}{Cs83E-saTwmxLLblIiW_Qw}{172.30.0.248}{172.30.0.248:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 37, last-accepted version 42511 in term 37
[2019-12-16T11:51:44,543][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-2-197.ec2.internal] collector [node_stats] timed out when collecting data

chrisronline · December 16, 2019, 6:55pm

Hmm. Okay, let's figure this out.

Let's run the query to fetch the Search Rate graph for one of these black-out periods and see what the data is telling us:

Fill in the <cluster_uuid> with the right cluster and then adjust the time period to the affected time period.

POST .monitoring-es-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "<cluster_uuid>"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-16T00:51:07.080Z",
              "lte": "2019-12-16T18:51:07.080Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "metric": {
          "max": {
            "field": "indices_stats._all.total.search.query_total"
          }
        },
        "metric_deriv": {
          "derivative": {
            "buckets_path": "metric",
            "gap_policy": "skip",
            "unit": "1s"
          }
        }
      }
    }
  }
}

Let's see if this helps us. Thanks!

Barak · December 17, 2019, 8:56am

So it appears that the data exists in the monitoring index, just not displayed in Kibana.
If I try to zoom on the blank timeframe in Kibana I get "Monitoring Request Failed. Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g . HTTP 404"

Here's a screenshot of the timeframe:

And here's the query and response
Query:

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-17T03:48:00.080Z",
              "lte": "2019-12-17T03:49:00.080Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "metric": {
          "max": {
            "field": "indices_stats._all.total.search.query_total"
          }
        },
        "metric_deriv": {
          "derivative": {
            "buckets_path": "metric",
            "gap_policy": "skip",
            "unit": "1s"
          }
        }
      }
    }
  }
}

Response (truncated):

    {
        "took": 10,
        "timed_out": false,
        "_shards": {
            "total": 7,
            "successful": 7,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 222,
                "relation": "eq"
            },
            "max_score": 0.0,
            "hits": [
                {
                    "_index": ".monitoring-es-7-2019.12.17",
                    "_type": "_doc",
                    "_id": "FcP4EW8Byx4LD79xQ0Os",
                    "_score": 0.0,
                    "_source": {
                        "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g",
                        "timestamp": "2019-12-17T03:48:00.543Z",
                        "interval_ms": 10000,
                        "type": "node_stats",
                        "source_node": {
                            "uuid": "5SmrUGH4SRuLDYLdrOt77g",
                            "host": "172.30.1.242",
                            "transport_address": "172.30.1.242:9300",
                            "ip": "172.30.1.242",
                            "name": "ip-172-30-1-242.ec2.internal",
                            "timestamp": "2019-12-17T03:48:00.543Z"
                        },
                        "node_stats": {
                            "node_id": "5SmrUGH4SRuLDYLdrOt77g",
                            "node_master": false,
                            "mlockall": false,
                            "indices": {
                                "docs": {
                                    "count": 26609220
                                },
                                "store": {
                                    "size_in_bytes": 49286464235
                                },
                                "indexing": {
                                    "index_total": 3303155,
                                    "index_time_in_millis": 1646136,
                                    "throttle_time_in_millis": 0
                                },
                                "search": {
                                    "query_total": 142097245,
                                    "query_time_in_millis": 855398284
                                },
                                "query_cache": {
                                    "memory_size_in_bytes": 32790152,
                                    "hit_count": 227876976,
                                    "miss_count": 249589745,
                                    "evictions": 20350234
                                },
                                "fielddata": {
                                    "memory_size_in_bytes": 0,
                                    "evictions": 0
                                },
                                "segments": {
                                    "count": 88,
                                    "memory_in_bytes": 11952968186,
                                    "terms_memory_in_bytes": 11943687306,
                                    "stored_fields_memory_in_bytes": 7562824,
                                    "term_vectors_memory_in_bytes": 0,
                                    "norms_memory_in_bytes": 54656,
                                    "points_memory_in_bytes": 978040,
                                    "doc_values_memory_in_bytes": 685360,
                                    "index_writer_memory_in_bytes": 849632,
                                    "version_map_memory_in_bytes": 748,
                                    "fixed_bit_set_memory_in_bytes": 132592
                                },
                                "request_cache": {
                                    "memory_size_in_bytes": 0,
                                    "evictions": 0,
                                    "hit_count": 30,
                                    "miss_count": 11390
                                }
                            },
                            "os": {
                                "cpu": {
                                    "load_average": {
                                        "1m": 0.2,
                                        "5m": 0.2,
                                        "15m": 1.11
                                    }
                                },
                                "cgroup": {
                                    "cpuacct": {
                                        "control_group": "/",
                                        "usage_nanos": 1407906507480565
                                    },
                                    "cpu": {
                                        "control_group": "/",
                                        "cfs_period_micros": 100000,
                                        "cfs_quota_micros": -1,
                                        "stat": {
                                            "number_of_elapsed_periods": 0,
                                            "number_of_times_throttled": 0,
                                            "time_throttled_nanos": 0
                                        }
                                    },
                                    "memory": {
                                        "control_group": "/",
                                        "limit_in_bytes": "9223372036854771712",
                                        "usage_in_bytes": "98930262016"
                                    }
                                }
                            },
                            "process": {
                                "open_file_descriptors": 898,
                                "max_file_descriptors": 65535,
                                "cpu": {
                                    "percent": 1
                                }
                            },
                            "jvm": {
                                "mem": {
                                    "heap_used_in_bytes": 19752021000,
                                    "heap_used_percent": 57,
                                    "heap_max_in_bytes": 34246361088
                                },
                                "gc": {
                                    "collectors": {
                                        "young": {
                                            "collection_count": 1582085,
                                            "collection_time_in_millis": 36134761
                                        },
                                        "old": {
                                            "collection_count": 33,
                                            "collection_time_in_millis": 6668
                                        }
                                    }
                                }
                            },
                            "thread_pool": {
                                "generic": {
                                    "threads": 20,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "get": {
                                    "threads": 1,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "management": {
                                    "threads": 5,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "search": {
                                    "threads": 25,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "watcher": {
                                    "threads": 0,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "write": {
                                    "threads": 16,
                                    "queue": 0,
                                    "rejected": 0
                                }
                            },
                            "fs": {
                                "total": {
                                    "total_in_bytes": 1056750854144,
                                    "free_in_bytes": 995028893696,
                                    "available_in_bytes": 951978315776
                                },
                                "io_stats": {
                                    "total": {
                                        "operations": 11420617,
                                        "read_operations": 444892,
                                        "write_operations": 10975725,
                                        "read_kilobytes": 16952040,
                                        "write_kilobytes": 290891716
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "_index": ".monitoring-es-7-2019.12.17",
                    "_type": "_doc",
                    "_id": "8x34EW8BQZsPnUc7RfgG",
                    "_score": 0.0,
                    "_source": {
                        "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g",
                        "timestamp": "2019-12-17T03:48:00.875Z",


.
.
.
    "aggregations": {
        "check": {
            "buckets": [
                {
                    "key_as_string": "2019-12-17T03:48:00.000Z",
                    "key": 1576554480000,
                    "doc_count": 111,
                    "metric": {
                        "value": 3.96591825E8
                    }
                },
                {
                    "key_as_string": "2019-12-17T03:48:30.000Z",
                    "key": 1576554510000,
                    "doc_count": 111,
                    "metric": {
                        "value": 3.966284E8
                    },
                    "metric_deriv": {
                        "value": 36575.0,
                        "normalized_value": 1219.1666666666667
                    }
                }
            ]
        }
    }
}

chrisronline · December 17, 2019, 8:58pm

Thanks for that!

The data is there, but I wonder if other data is missing which would cause the UI to error out like that.

Let's try another query:

POST .monitoring-es-*/_search
{
  "size": 1000,
  "sort": {
    "timestamp": {
      "order": "desc"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "cluster_stats"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-17T03:48:00.080Z",
              "lte": "2019-12-17T03:49:00.080Z"
            }
          }
        }
      ]
    }
  },
  "collapse": {
    "field": "cluster_uuid"
  }
}

Barak · December 23, 2019, 8:04am

This one also returned a valid response:

gist.github.com

https://gist.github.com/barakseri1/979f0253066cca01b31c7c0686d16dd3

cluster_stats

{
    "took": 665,
    "timed_out": false,
    "_shards": {
        "total": 7,
        "successful": 7,
        "skipped": 0,
        "failed": 0
    },
    "hits": {

This file has been truncated. show original

chrisronline · December 23, 2019, 2:21pm

Thanks for that. I'm not exactly sure still.

Can we capture and upload a HAR? Make sure you start capturing after selecting the time period which causes the issue. The capture doesn't need to be long - we just need to capture a single set of xhr requests to kibana server.

See https://community.box.com/t5/Managing-Content-Troubleshooting/How-to-Generate-Network-Captures-for-Troubleshooting/ta-p/366#toc-hId--671775661

Barak · December 23, 2019, 3:16pm

Thanks @chrisronline.
The HAR file is here: https://transfernow.net/ddl/elastic

In the responses you can see many null values, they were generated when I selected a timeframe with a gap inside but also containing some data before and after.
The 404 was generated when selecting only an empty timeframe.

chrisronline · December 23, 2019, 4:25pm

Yea this is interesting. There seems to be data in the affected time period (versus just the actual absence of data), but I don't quite know why the UI is not showing anything.

I don't have specific next steps, but I have an idea of how we can debug this further.

Let's turn on querying logging for monitoring and get a list of queries that are executed exclusively in the affected time window (don't bother expanding the time period to include "valid" data points).

You'll need two configs in kibana.yml to do this:

xpack.monitoring.elasticsearch.logQueries: true
logging.verbose: true

Get those queries, and then, for each one, execute them in Kibana dev tools console and return the query/response for each.

We'll figure this out!

Barak · December 24, 2019, 3:36pm

I didn't see the post data in the logs. That is what was generated during the request for the empty timeframe:

Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["debug","legacy-proxy"],"pid":13484,"message":"Event is being forwarded: connection"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["debug","legacy-service"],"pid":13484,"message":"Request will be handled by proxy POST:/api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch."}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"error","@timestamp":"2019-12-24T15:28:42Z","tags":["error","monitoring"],"pid":13484,"level":"error","error":{"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g","name":"Error","stack":"Error: Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g\n    at then.clusters (/usr/share/kibana/x-pack/plugins/monitoring/server/lib/cluster/get_cluster_stats.js:31:15)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"},"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["license","debug","xpack"],"pid":13484,"message":"Calling [data] Elasticsearch _xpack API. Polling frequency: 30001"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"response","@timestamp":"2019-12-24T15:28:42Z","tags":[],"pid":13484,"method":"post","statusCode":404,"req":{"url":"/api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch","method":"post","headers":{"host":"elasticsearch.gurushots.info:5601","connection":"keep-alive","content-length":"81","accept":"application/json, text/plain, */*","origin":"http://elasticsearch.gurushots.info:5601","kbn-version":"7.2.0","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","content-type":"application/json;charset=UTF-8","referer":"http://elasticsearch.gurushots.info:5601/app/monitoring","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.9,he;q=0.8"},"remoteAddress":"31.168.7.162","userAgent":"31.168.7.162","referer":"http://elasticsearch.gurushots.info:5601/app/monitoring"},"res":{"statusCode":404,"responseTime":487,"contentLength":9},"message":"POST /api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch 404 487ms - 9.0B"}

chrisronline · December 30, 2019, 3:10pm

Ah, what are your other xpack.monitoring.* settings in kibana.yml?

Barak · December 31, 2019, 8:00am

There are none
Other than changing the ES timeout, I didn't touch any of the settings:

# Kibana is served by a back end server. This setting specifies the port to use.
#server.port: 5601

# Specifies the address to which the Kibana server will bind. IP addresses and host names are both valid values.
# The default is 'localhost', which usually means remote machines will not be able to connect.
# To allow connections from remote users, set this parameter to a non-loopback address.
server.host: 0.0.0.0

# Enables you to specify a path to mount Kibana at if you are running behind a proxy.
# Use the `server.rewriteBasePath` setting to tell Kibana if it should remove the basePath
# from requests it receives, and to prevent a deprecation warning at startup.
# This setting cannot end in a slash.
#server.basePath: ""

# Specifies whether Kibana should rewrite requests that are prefixed with
# `server.basePath` or require that they are rewritten by your reverse proxy.
# This setting was effectively always `false` before Kibana 6.3 and will
# default to `true` starting in Kibana 7.0.
#server.rewriteBasePath: false

# The maximum payload size in bytes for incoming server requests.
#server.maxPayloadBytes: 1048576

# The Kibana server's name.  This is used for display purposes.
server.name: "Kibana"

# The URLs of the Elasticsearch instances to use for all your queries.
#elasticsearch.hosts: ["http://localhost:9200"]

# When this setting's value is true Kibana uses the hostname specified in the server.host
# setting. When the value of this setting is false, Kibana uses the hostname of the host
# that connects to this Kibana instance.
#elasticsearch.preserveHost: true

# Kibana uses an index in Elasticsearch to store saved searches, visualizations and
# dashboards. Kibana creates a new index if the index doesn't already exist.
#kibana.index: ".kibana"

# The default application to load.
#kibana.defaultAppId: "home"

# If your Elasticsearch is protected with basic authentication, these settings provide
# the username and password that the Kibana server uses to perform maintenance on the Kibana
# index at startup. Your Kibana users still need to authenticate with Elasticsearch, which
# is proxied through the Kibana server.
#elasticsearch.username: "user"
#elasticsearch.password: "pass"

# Enables SSL and paths to the PEM-format SSL certificate and SSL key files, respectively.
# These settings enable SSL for outgoing requests from the Kibana server to the browser.
#server.ssl.enabled: false
#server.ssl.certificate: /path/to/your/server.crt
#server.ssl.key: /path/to/your/server.key

# Optional settings that provide the paths to the PEM-format SSL certificate and key files.
# These files validate that your Elasticsearch backend uses the same key files.
#elasticsearch.ssl.certificate: /path/to/your/client.crt
#elasticsearch.ssl.key: /path/to/your/client.key

# Optional setting that enables you to specify a path to the PEM file for the certificate
# authority for your Elasticsearch instance.
#elasticsearch.ssl.certificateAuthorities: [ "/path/to/your/CA.pem" ]

# To disregard the validity of SSL certificates, change this setting's value to 'none'.
#elasticsearch.ssl.verificationMode: full

# Time in milliseconds to wait for Elasticsearch to respond to pings. Defaults to the value of
# the elasticsearch.requestTimeout setting.
#elasticsearch.pingTimeout: 1500

# Time in milliseconds to wait for responses from the back end or Elasticsearch. This value
# must be a positive integer.
elasticsearch.requestTimeout: 60000

# List of Kibana client-side headers to send to Elasticsearch. To send *no* client-side
# headers, set this value to [] (an empty list).
#elasticsearch.requestHeadersWhitelist: [ authorization ]

# Header names and values that are sent to Elasticsearch. Any custom headers cannot be overwritten
# by client-side headers, regardless of the elasticsearch.requestHeadersWhitelist configuration.
#elasticsearch.customHeaders: {}

# Time in milliseconds for Elasticsearch to wait for responses from shards. Set to 0 to disable.
#elasticsearch.shardTimeout: 30000

# Time in milliseconds to wait for Elasticsearch at Kibana startup before retrying.
#elasticsearch.startupTimeout: 5000

# Logs queries sent to Elasticsearch. Requires logging.verbose set to true.
#elasticsearch.logQueries: false

# Specifies the path where Kibana creates the process ID file.
#pid.file: /var/run/kibana.pid

# Enables you specify a file where Kibana stores log output.
#logging.dest: stdout

# Set the value of this setting to true to suppress all logging output.
#logging.silent: false

# Set the value of this setting to true to suppress all logging output other than error messages.
#logging.quiet: false

# Set the value of this setting to true to log all events, including system usage information
# and all requests.
#logging.verbose: false

# Set the interval in milliseconds to sample system and process performance
# metrics. Minimum is 100ms. Defaults to 5000.
#ops.interval: 5000

# Specifies locale to be used for all localizable strings, dates and number formats.
#i18n.locale: "en"


##debugging monitoring gaps
#xpack.monitoring.elasticsearch.logQueries: true
#logging.verbose: true

chrisronline · December 31, 2019, 11:49am

Ah, I think I know why the queries aren't logging.

I'm assuming your ES is running at the default address (http://localhost:9200) and if so, add this configuration:

xpack.monitoring.elasticsearch.hosts: ["http://localhost:9200"]

Then, comment back in xpack.monitoring.elasticsearch.logQueries: true and lemme know if the queries show up

Barak · January 1, 2020, 3:27pm

Thanks @chrisronline, it did work. I invoked the request caught in the Kibana logs and surprisingly it returned a valid response.

This is from the logs:

gist.github.com

https://gist.github.com/barakseri1/9282cef2afa82c8e481eb60056c6ce92

gistfile1.txt

//Monitoring request for "gap" period
Jan  1 15:14:21 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2020-01-01T15:14:21Z","tags":["debug","elasticsearch","monitoring","query"],"pid":11073,"message":"200\nPOST /.monitoring-es-6-*%2C.monitoring-es-7-*/_search?size=10000&ignore_unavailable=true&filter_path=hits.hits._index%2Chits.hits._source.cluster_uuid%2Chits.hits._source.cluster_name%2Chits.hits._source.version%2Chits.hits._source.license.status%2Chits.hits._source.license.type%2Chits.hits._source.license.issue_date%2Chits.hits._source.license.expiry_date%2Chits.hits._source.license.expiry_date_in_millis%2Chits.hits._source.cluster_stats%2Chits.hits._source.cluster_state%2Chits.hits._source.cluster_settings.cluster.metadata.display_name\n{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"cluster_stats\"}},{\"term\":{\"cluster_uuid\":\"vd63AAWCTc6rZ1RiIRSK4g\"}},{\"range\":{\"timestamp\":{\"format\":\"epoch_millis\",\"gte\":1577829973994,\"lte\":1577830091905}}}]}},\"collapse\":{\"field\":\"cluster_uuid\"},\"sort\":{\"timestamp\":{\"order\":\"desc\"}}}"}

//Returned error
Jan  1 15:14:21 ip-172-30-0-37 kibana: {"type":"error","@timestamp":"2020-01-01T15:14:21Z","tags":["error","monitoring"],"pid":11073,"level":"error","error":{"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g","name":"Error","stack":"Error: Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g\n    at then.clusters (/usr/share/kibana/x-pack/plugins/monitoring/server/lib/cluster/get_cluster_stats.js:31:15)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"},"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g"}

This is the request I invoked (taken from the logs) and the response:

gist.github.com

https://gist.github.com/barakseri1/88343c8aba70d09914f669ac122321ea

Manual request and repsonse

//Request
POST /.monitoring-es-6-*%2C.monitoring-es-7-*/_search?size=10000&ignore_unavailable=true&filter_path=hits.hits._index%2Chits.hits._source.cluster_uuid%2Chits.hits._source.cluster_name%2Chits.hits._source.version%2Chits.hits._source.license.status%2Chits.hits._source.license.type%2Chits.hits._source.license.issue_date%2Chits.hits._source.license.expiry_date%2Chits.hits._source.license.expiry_date_in_millis%2Chits.hits._source.cluster_stats%2Chits.hits._source.cluster_state%2Chits.hits._source.cluster_settings.cluster.metadata.display_name\n{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"cluster_stats\"}},{\"term\":{\"cluster_uuid\":\"vd63AAWCTc6rZ1RiIRSK4g\"}},{\"range\":{\"timestamp\":{\"format\":\"epoch_millis\",\"gte\":1577829973994,\"lte\":1577830091905}}}]}},\"collapse\":{\"field\":\"cluster_uuid\"},\"sort\":{\"timestamp\":{\"order\":\"desc\"}}}"

//Response

{
  "hits" : {
    "hits" : [
      {
        "_index" : ".monitoring-es-7-2019.12.26",

This file has been truncated. show original

Thanks for the assistance!

chrisronline · January 2, 2020, 6:04pm

Hey @Barak,

So the the manual request you ran in the last reply, that response looks strange.

The formatted query looks like:

POST .monitoring-es-6-*,.monitoring-es-7-*/_search?size=10000&ignore_unavailable=true&filter_path=hits.hits._index,hits.hits._source.cluster_uuid,hits.hits._source.cluster_name,hits.hits._source.version,hits.hits._source.license.status,hits.hits._source.license.type,hits.hits._source.license.issue_date,hits.hits._source.license.expiry_date,hits.hits._source.license.expiry_date_in_millis,hits.hits._source.cluster_stats,hits.hits._source.cluster_state,hits.hits._source.cluster_settings.cluster.metadata.display_name
{
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "type": {
        "value": "cluster_stats"
      }
    }
  },
  "collapse": {
    "field": "cluster_uuid"
  }
}

which is very similar to an earlier query you ran but the response looks very different.

The collapse in the body should ensure you only see a single hit per unique cluster uuid, whereas your response looks like it contains the same one, over and over.

Can you double check you ran the right query?

system · January 30, 2020, 6:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kibana monitor data display nothing Kibana elastic-stack-monitoring	8	468	May 5, 2022
Gaps in monitoring diagrams of cluster Kibana elastic-stack-monitoring	1	279	January 2, 2023
Monitoring Gaps in Marvel Data Elasticsearch	1	418	August 7, 2018
Missing data points on monitoring Kibana	4	587	June 16, 2019
Missing Data Nodes on Cluster Monitoring of Kibana Elasticsearch elastic-stack-monitoring	11	1405	November 9, 2022

Kibana monitoring gaps when data node is replaced

Related topics