Kibana monitoring gaps when data node is replaced

Hi,

I am using ES 7.2.0 on AWS spot instances.
Cluster is comprised of 5 Master nodes, 8 Data nodes and 2 Coordinators.

I noticed that whenever a data node is replaced (due to spot replacement), there are monitoring gaps in the whole cluster

During the gaps I tested GET .monitoring-es-7-/_search* and it returned valid information.
The gap usually starts when the node goes up, until the EBS was warmed up and it's read latency is reduced (usually takes ~20-30 mins).

Note that the cluster is functioning properly during the gaps periods, and it is in yellow state, as a few shards are unassigned for this period.

Question is, why is the monitoring of the whole cluster affected by this one node that may not respond to these monitoring queries?

Thanks

Hi @Barak,

A couple of things we should double check:

  1. Are your master nodes exclusively master nodes?
  2. During this period of time, are there any logs indicating discovery (or other) issues on the master nodes?

Sorry for the late response @chrisronline.

  1. Yes they are exclusive.
  2. I found the below logs on the master node.

It looks like the new data node that joined (ip-172-30-1-6.ec2.internal) got disconnected a few times, this may be caused due to intensive i/o on the disk in the initial steps, however what I find odd is that it affects the monitoring of the whole cluster, instead of affecting the monitoring of that specific node only.

[2019-12-15T12:02:04,886][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-2-69.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{jENiFLyySW6bymfiCgLvSg}{17
2.30.2.69}{172.30.2.69:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41094, reason: ApplyCom
mitRequest{term=37, version=41094, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:13:06,893][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.
30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41115, reason: ApplyCommi
tRequest{term=37, version=41115, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-e
ast-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:27:16,545][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{17
2.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41126, reason: ApplyCom
mitRequest{term=37, version=41126, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:27:16,546][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:109) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:112) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$8.run(TransportService.java:973) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
        ... 11 more
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ip-172-30-0-16.ec2.internal][172.30.0.16:9300][indices:data/write/bulk] disconnected
[2019-12-15T12:27:29,357][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.
30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41138, reason: ApplyCommi
tRequest{term=37, version=41138, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-e
ast-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:31:04,574][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{17
2.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41146, reason: ApplyCom
mitRequest{term=37, version=41146, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us
-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:31:04,575][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
        at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:109) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:112) [x-pack-monitoring-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:50) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:74) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.transport.TransportService$8.run(TransportService.java:973) [elasticsearch-7.2.0.jar:7.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.2.0.jar:7.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
        ... 11 more
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ip-172-30-0-16.ec2.internal][172.30.0.16:9300][indices:data/write/bulk] disconnected
[2019-12-15T12:33:20,780][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] added {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41161, reason: ApplyCommitRequest{term=37, version=41161, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:36:02,463][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-1-6.ec2.internal] removed {{ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true},}, term: 37, version: 41162, reason: ApplyCommitRequest{term=37, version=41162, sourceNode={ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}}
[2019-12-15T12:36:02,463][WARN ][o.e.x.m.MonitoringService] [ip-172-30-1-6.ec2.internal] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks

In addition, I can see the below on the data node.
The node says "master not discovered yet" and then lists all the master nodes.

After digging a bit I found a similar issue caused by loading the global ordinals on large shards, however from my understanding global ordinals are loaded only after a search that contained aggregations, and our use case does not involve such.

[2019-12-16T11:51:32,570][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-2-197.ec2.internal] collector [node_stats] timed out when collecting data
[2019-12-16T11:51:36,709][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-172-30-2-197.ec2.internal] master not discovered yet: have discovered [{ip-172-30-1-6.ec2.internal}{Xd6Ex6OXRdac4pwZCxyTEQ}{yykEZM_BQaiyFoXTLYeF0A}{172.30.1.6}{172.30.1.6:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-48.ec2.internal}{FcGYX-cHSNSnszWd2xO0Rg}{c8_lsSoeTGu1FTCGOOlxmQ}{172.30.2.48}{172.30.2.48:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-54.ec2.internal}{qN0zCaZhTCGPGu849drP1Q}{3LTpzVViSICI6BbMRoyLow}{172.30.0.54}{172.30.0.54:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-248.ec2.internal}{EODGswyyT0WpM8FMUcIRSQ}{Cs83E-saTwmxLLblIiW_Qw}{172.30.0.248}{172.30.0.248:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 172.30.0.37:9300, 172.30.0.148:9300, 172.30.0.54:9300, 172.30.0.248:9300, 172.30.1.6:9300, 172.30.1.176:9300, 172.30.2.48:9300] from hosts providers and [{ip-172-30-1-6.ec2.internal}{Xd6Ex6OXRdac4pwZCxyTEQ}{yykEZM_BQaiyFoXTLYeF0A}{172.30.1.6}{172.30.1.6:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-15.ec2.internal}{1ogTu-_qQcCMCwLATVjDpg}{bneaD8Q0TEeVX7fAeI77WQ}{172.30.0.15}{172.30.0.15:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-198.ec2.internal}{Pulg8N-8R0KfX5yUwiC_dQ}{yNSnzzDtRiqj50sJkyNxyQ}{172.30.2.198}{172.30.2.198:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-148.ec2.internal}{zJiqfiIZQIGtELtjY7jnxg}{T5-yHwR_ShafZognjIj7-w}{172.30.0.148}{172.30.0.148:9300}{ml.machine_memory=8362668032, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-217.ec2.internal}{MAgVF9bpRC6lAOwNxWNj7A}{bdauGFO9SeGunEVrnmwayw}{172.30.0.217}{172.30.0.217:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-242.ec2.internal}{5SmrUGH4SRuLDYLdrOt77g}{G1v6wDcaR6iio_tr4ajLuw}{172.30.1.242}{172.30.1.242:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=133656322048, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-48.ec2.internal}{rcWaUYZTSUaAviidjIESRA}{N1xKewxMTkuta5wXSZivYg}{172.30.1.48}{172.30.1.48:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-94.ec2.internal}{lYPBVwzbQhuprJql2Lamdg}{Ic2TAyPdR1GyCKLugB0S8Q}{172.30.1.94}{172.30.1.94:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-1-176.ec2.internal}{yNxtuJaRQySfySaoimiJwQ}{TIsubmBgTIOD8nkx3mZNrQ}{172.30.1.176}{172.30.1.176:9300}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-48.ec2.internal}{FcGYX-cHSNSnszWd2xO0Rg}{c8_lsSoeTGu1FTCGOOlxmQ}{172.30.2.48}{172.30.2.48:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-37.ec2.internal}{If_ODG8WQnKKWLiU7PmRdw}{jjpe0Q7uQ9ib7G4sSKNDXw}{172.30.0.37}{172.30.0.37:9300}{ml.machine_memory=8362668032, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-54.ec2.internal}{qN0zCaZhTCGPGu849drP1Q}{3LTpzVViSICI6BbMRoyLow}{172.30.0.54}{172.30.0.54:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-91.ec2.internal}{hkMoT0WES-C2F-OnUnc-_A}{CfbGhwBoTCaWqjkQjVVDfw}{172.30.2.91}{172.30.2.91:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=66010615808, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-2-197.ec2.internal}{e0bwfaScSHmflJ1MSqXdxA}{dLHLLY6zSfC9Hi9UST9-dA}{172.30.2.197}{172.30.2.197:9300}{aws_availability_zone=us-east-1c, ml.machine_memory=67534430208, xpack.installed=true, ml.max_open_jobs=20}, {ip-172-30-0-248.ec2.internal}{EODGswyyT0WpM8FMUcIRSQ}{Cs83E-saTwmxLLblIiW_Qw}{172.30.0.248}{172.30.0.248:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true}, {ip-172-30-0-16.ec2.internal}{1N90K53yQRiEZer2Jqo4dg}{FaXvQ29tSAGIXooNLgp0wg}{172.30.0.16}{172.30.0.16:9300}{aws_availability_zone=us-east-1a, ml.machine_memory=66715250688, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 37, last-accepted version 42511 in term 37
[2019-12-16T11:51:44,543][ERROR][o.e.x.m.c.n.NodeStatsCollector] [ip-172-30-2-197.ec2.internal] collector [node_stats] timed out when collecting data

Hmm. Okay, let's figure this out.

Let's run the query to fetch the Search Rate graph for one of these black-out periods and see what the data is telling us:

Fill in the <cluster_uuid> with the right cluster and then adjust the time period to the affected time period.

POST .monitoring-es-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "<cluster_uuid>"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-16T00:51:07.080Z",
              "lte": "2019-12-16T18:51:07.080Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "metric": {
          "max": {
            "field": "indices_stats._all.total.search.query_total"
          }
        },
        "metric_deriv": {
          "derivative": {
            "buckets_path": "metric",
            "gap_policy": "skip",
            "unit": "1s"
          }
        }
      }
    }
  }
}

Let's see if this helps us. Thanks!

So it appears that the data exists in the monitoring index, just not displayed in Kibana.
If I try to zoom on the blank timeframe in Kibana I get "Monitoring Request Failed. Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g . HTTP 404"

Here's a screenshot of the timeframe:

And here's the query and response
Query:

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-17T03:48:00.080Z",
              "lte": "2019-12-17T03:49:00.080Z"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "metric": {
          "max": {
            "field": "indices_stats._all.total.search.query_total"
          }
        },
        "metric_deriv": {
          "derivative": {
            "buckets_path": "metric",
            "gap_policy": "skip",
            "unit": "1s"
          }
        }
      }
    }
  }
}

Response (truncated):

    {
        "took": 10,
        "timed_out": false,
        "_shards": {
            "total": 7,
            "successful": 7,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 222,
                "relation": "eq"
            },
            "max_score": 0.0,
            "hits": [
                {
                    "_index": ".monitoring-es-7-2019.12.17",
                    "_type": "_doc",
                    "_id": "FcP4EW8Byx4LD79xQ0Os",
                    "_score": 0.0,
                    "_source": {
                        "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g",
                        "timestamp": "2019-12-17T03:48:00.543Z",
                        "interval_ms": 10000,
                        "type": "node_stats",
                        "source_node": {
                            "uuid": "5SmrUGH4SRuLDYLdrOt77g",
                            "host": "172.30.1.242",
                            "transport_address": "172.30.1.242:9300",
                            "ip": "172.30.1.242",
                            "name": "ip-172-30-1-242.ec2.internal",
                            "timestamp": "2019-12-17T03:48:00.543Z"
                        },
                        "node_stats": {
                            "node_id": "5SmrUGH4SRuLDYLdrOt77g",
                            "node_master": false,
                            "mlockall": false,
                            "indices": {
                                "docs": {
                                    "count": 26609220
                                },
                                "store": {
                                    "size_in_bytes": 49286464235
                                },
                                "indexing": {
                                    "index_total": 3303155,
                                    "index_time_in_millis": 1646136,
                                    "throttle_time_in_millis": 0
                                },
                                "search": {
                                    "query_total": 142097245,
                                    "query_time_in_millis": 855398284
                                },
                                "query_cache": {
                                    "memory_size_in_bytes": 32790152,
                                    "hit_count": 227876976,
                                    "miss_count": 249589745,
                                    "evictions": 20350234
                                },
                                "fielddata": {
                                    "memory_size_in_bytes": 0,
                                    "evictions": 0
                                },
                                "segments": {
                                    "count": 88,
                                    "memory_in_bytes": 11952968186,
                                    "terms_memory_in_bytes": 11943687306,
                                    "stored_fields_memory_in_bytes": 7562824,
                                    "term_vectors_memory_in_bytes": 0,
                                    "norms_memory_in_bytes": 54656,
                                    "points_memory_in_bytes": 978040,
                                    "doc_values_memory_in_bytes": 685360,
                                    "index_writer_memory_in_bytes": 849632,
                                    "version_map_memory_in_bytes": 748,
                                    "fixed_bit_set_memory_in_bytes": 132592
                                },
                                "request_cache": {
                                    "memory_size_in_bytes": 0,
                                    "evictions": 0,
                                    "hit_count": 30,
                                    "miss_count": 11390
                                }
                            },
                            "os": {
                                "cpu": {
                                    "load_average": {
                                        "1m": 0.2,
                                        "5m": 0.2,
                                        "15m": 1.11
                                    }
                                },
                                "cgroup": {
                                    "cpuacct": {
                                        "control_group": "/",
                                        "usage_nanos": 1407906507480565
                                    },
                                    "cpu": {
                                        "control_group": "/",
                                        "cfs_period_micros": 100000,
                                        "cfs_quota_micros": -1,
                                        "stat": {
                                            "number_of_elapsed_periods": 0,
                                            "number_of_times_throttled": 0,
                                            "time_throttled_nanos": 0
                                        }
                                    },
                                    "memory": {
                                        "control_group": "/",
                                        "limit_in_bytes": "9223372036854771712",
                                        "usage_in_bytes": "98930262016"
                                    }
                                }
                            },
                            "process": {
                                "open_file_descriptors": 898,
                                "max_file_descriptors": 65535,
                                "cpu": {
                                    "percent": 1
                                }
                            },
                            "jvm": {
                                "mem": {
                                    "heap_used_in_bytes": 19752021000,
                                    "heap_used_percent": 57,
                                    "heap_max_in_bytes": 34246361088
                                },
                                "gc": {
                                    "collectors": {
                                        "young": {
                                            "collection_count": 1582085,
                                            "collection_time_in_millis": 36134761
                                        },
                                        "old": {
                                            "collection_count": 33,
                                            "collection_time_in_millis": 6668
                                        }
                                    }
                                }
                            },
                            "thread_pool": {
                                "generic": {
                                    "threads": 20,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "get": {
                                    "threads": 1,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "management": {
                                    "threads": 5,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "search": {
                                    "threads": 25,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "watcher": {
                                    "threads": 0,
                                    "queue": 0,
                                    "rejected": 0
                                },
                                "write": {
                                    "threads": 16,
                                    "queue": 0,
                                    "rejected": 0
                                }
                            },
                            "fs": {
                                "total": {
                                    "total_in_bytes": 1056750854144,
                                    "free_in_bytes": 995028893696,
                                    "available_in_bytes": 951978315776
                                },
                                "io_stats": {
                                    "total": {
                                        "operations": 11420617,
                                        "read_operations": 444892,
                                        "write_operations": 10975725,
                                        "read_kilobytes": 16952040,
                                        "write_kilobytes": 290891716
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "_index": ".monitoring-es-7-2019.12.17",
                    "_type": "_doc",
                    "_id": "8x34EW8BQZsPnUc7RfgG",
                    "_score": 0.0,
                    "_source": {
                        "cluster_uuid": "vd63AAWCTc6rZ1RiIRSK4g",
                        "timestamp": "2019-12-17T03:48:00.875Z",


.
.
.
    "aggregations": {
        "check": {
            "buckets": [
                {
                    "key_as_string": "2019-12-17T03:48:00.000Z",
                    "key": 1576554480000,
                    "doc_count": 111,
                    "metric": {
                        "value": 3.96591825E8
                    }
                },
                {
                    "key_as_string": "2019-12-17T03:48:30.000Z",
                    "key": 1576554510000,
                    "doc_count": 111,
                    "metric": {
                        "value": 3.966284E8
                    },
                    "metric_deriv": {
                        "value": 36575.0,
                        "normalized_value": 1219.1666666666667
                    }
                }
            ]
        }
    }
}

Thanks for that!

The data is there, but I wonder if other data is missing which would cause the UI to error out like that.

Let's try another query:

POST .monitoring-es-*/_search
{
  "size": 1000,
  "sort": {
    "timestamp": {
      "order": "desc"
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "cluster_stats"
          }
        },
        {
          "range": {
            "timestamp": {
              "gte": "2019-12-17T03:48:00.080Z",
              "lte": "2019-12-17T03:49:00.080Z"
            }
          }
        }
      ]
    }
  },
  "collapse": {
    "field": "cluster_uuid"
  }
}

This one also returned a valid response:

Thanks for that. I'm not exactly sure still.

Can we capture and upload a HAR? Make sure you start capturing after selecting the time period which causes the issue. The capture doesn't need to be long - we just need to capture a single set of xhr requests to kibana server.

See https://community.box.com/t5/Managing-Content-Troubleshooting/How-to-Generate-Network-Captures-for-Troubleshooting/ta-p/366#toc-hId--671775661

Thanks @chrisronline.
The HAR file is here: https://transfernow.net/ddl/elastic

In the responses you can see many null values, they were generated when I selected a timeframe with a gap inside but also containing some data before and after.
The 404 was generated when selecting only an empty timeframe.

Yea this is interesting. There seems to be data in the affected time period (versus just the actual absence of data), but I don't quite know why the UI is not showing anything.

I don't have specific next steps, but I have an idea of how we can debug this further.

Let's turn on querying logging for monitoring and get a list of queries that are executed exclusively in the affected time window (don't bother expanding the time period to include "valid" data points).

You'll need two configs in kibana.yml to do this:

xpack.monitoring.elasticsearch.logQueries: true
logging.verbose: true

Get those queries, and then, for each one, execute them in Kibana dev tools console and return the query/response for each.

We'll figure this out!

I didn't see the post data in the logs. That is what was generated during the request for the empty timeframe:

Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["debug","legacy-proxy"],"pid":13484,"message":"Event is being forwarded: connection"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["debug","legacy-service"],"pid":13484,"message":"Request will be handled by proxy POST:/api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch."}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"error","@timestamp":"2019-12-24T15:28:42Z","tags":["error","monitoring"],"pid":13484,"level":"error","error":{"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g","name":"Error","stack":"Error: Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g\n    at then.clusters (/usr/share/kibana/x-pack/plugins/monitoring/server/lib/cluster/get_cluster_stats.js:31:15)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"},"message":"Unable to find the cluster in the selected time range. UUID: vd63AAWCTc6rZ1RiIRSK4g"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"log","@timestamp":"2019-12-24T15:28:42Z","tags":["license","debug","xpack"],"pid":13484,"message":"Calling [data] Elasticsearch _xpack API. Polling frequency: 30001"}
Dec 24 15:28:42 ip-172-30-0-37 kibana: {"type":"response","@timestamp":"2019-12-24T15:28:42Z","tags":[],"pid":13484,"method":"post","statusCode":404,"req":{"url":"/api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch","method":"post","headers":{"host":"elasticsearch.gurushots.info:5601","connection":"keep-alive","content-length":"81","accept":"application/json, text/plain, */*","origin":"http://elasticsearch.gurushots.info:5601","kbn-version":"7.2.0","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36","content-type":"application/json;charset=UTF-8","referer":"http://elasticsearch.gurushots.info:5601/app/monitoring","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.9,he;q=0.8"},"remoteAddress":"31.168.7.162","userAgent":"31.168.7.162","referer":"http://elasticsearch.gurushots.info:5601/app/monitoring"},"res":{"statusCode":404,"responseTime":487,"contentLength":9},"message":"POST /api/monitoring/v1/clusters/vd63AAWCTc6rZ1RiIRSK4g/elasticsearch 404 487ms - 9.0B"}

Ah, what are your other xpack.monitoring.* settings in kibana.yml?

There are none :man_shrugging:t4:
Other than changing the ES timeout, I didn't touch any of the settings:

# Kibana is served by a back end server. This setting specifies the port to use.
#server.port: 5601

# Specifies the address to which the Kibana server will bind. IP addresses and host names are both valid values.
# The default is 'localhost', which usually means remote machines will not be able to connect.
# To allow connections from remote users, set this parameter to a non-loopback address.
server.host: 0.0.0.0

# Enables you to specify a path to mount Kibana at if you are running behind a proxy.
# Use the `server.rewriteBasePath` setting to tell Kibana if it should remove the basePath
# from requests it receives, and to prevent a deprecation warning at startup.
# This setting cannot end in a slash.
#server.basePath: ""

# Specifies whether Kibana should rewrite requests that are prefixed with
# `server.basePath` or require that they are rewritten by your reverse proxy.
# This setting was effectively always `false` before Kibana 6.3 and will
# default to `true` starting in Kibana 7.0.
#server.rewriteBasePath: false

# The maximum payload size in bytes for incoming server requests.
#server.maxPayloadBytes: 1048576

# The Kibana server's name.  This is used for display purposes.
server.name: "Kibana"

# The URLs of the Elasticsearch instances to use for all your queries.
#elasticsearch.hosts: ["http://localhost:9200"]

# When this setting's value is true Kibana uses the hostname specified in the server.host
# setting. When the value of this setting is false, Kibana uses the hostname of the host
# that connects to this Kibana instance.
#elasticsearch.preserveHost: true

# Kibana uses an index in Elasticsearch to store saved searches, visualizations and
# dashboards. Kibana creates a new index if the index doesn't already exist.
#kibana.index: ".kibana"

# The default application to load.
#kibana.defaultAppId: "home"

# If your Elasticsearch is protected with basic authentication, these settings provide
# the username and password that the Kibana server uses to perform maintenance on the Kibana
# index at startup. Your Kibana users still need to authenticate with Elasticsearch, which
# is proxied through the Kibana server.
#elasticsearch.username: "user"
#elasticsearch.password: "pass"

# Enables SSL and paths to the PEM-format SSL certificate and SSL key files, respectively.
# These settings enable SSL for outgoing requests from the Kibana server to the browser.
#server.ssl.enabled: false
#server.ssl.certificate: /path/to/your/server.crt
#server.ssl.key: /path/to/your/server.key

# Optional settings that provide the paths to the PEM-format SSL certificate and key files.
# These files validate that your Elasticsearch backend uses the same key files.
#elasticsearch.ssl.certificate: /path/to/your/client.crt
#elasticsearch.ssl.key: /path/to/your/client.key

# Optional setting that enables you to specify a path to the PEM file for the certificate
# authority for your Elasticsearch instance.
#elasticsearch.ssl.certificateAuthorities: [ "/path/to/your/CA.pem" ]

# To disregard the validity of SSL certificates, change this setting's value to 'none'.
#elasticsearch.ssl.verificationMode: full

# Time in milliseconds to wait for Elasticsearch to respond to pings. Defaults to the value of
# the elasticsearch.requestTimeout setting.
#elasticsearch.pingTimeout: 1500

# Time in milliseconds to wait for responses from the back end or Elasticsearch. This value
# must be a positive integer.
elasticsearch.requestTimeout: 60000

# List of Kibana client-side headers to send to Elasticsearch. To send *no* client-side
# headers, set this value to [] (an empty list).
#elasticsearch.requestHeadersWhitelist: [ authorization ]

# Header names and values that are sent to Elasticsearch. Any custom headers cannot be overwritten
# by client-side headers, regardless of the elasticsearch.requestHeadersWhitelist configuration.
#elasticsearch.customHeaders: {}

# Time in milliseconds for Elasticsearch to wait for responses from shards. Set to 0 to disable.
#elasticsearch.shardTimeout: 30000

# Time in milliseconds to wait for Elasticsearch at Kibana startup before retrying.
#elasticsearch.startupTimeout: 5000

# Logs queries sent to Elasticsearch. Requires logging.verbose set to true.
#elasticsearch.logQueries: false

# Specifies the path where Kibana creates the process ID file.
#pid.file: /var/run/kibana.pid

# Enables you specify a file where Kibana stores log output.
#logging.dest: stdout

# Set the value of this setting to true to suppress all logging output.
#logging.silent: false

# Set the value of this setting to true to suppress all logging output other than error messages.
#logging.quiet: false

# Set the value of this setting to true to log all events, including system usage information
# and all requests.
#logging.verbose: false

# Set the interval in milliseconds to sample system and process performance
# metrics. Minimum is 100ms. Defaults to 5000.
#ops.interval: 5000

# Specifies locale to be used for all localizable strings, dates and number formats.
#i18n.locale: "en"


##debugging monitoring gaps
#xpack.monitoring.elasticsearch.logQueries: true
#logging.verbose: true

Ah, I think I know why the queries aren't logging.

I'm assuming your ES is running at the default address (http://localhost:9200) and if so, add this configuration:

xpack.monitoring.elasticsearch.hosts: ["http://localhost:9200"]

Then, comment back in xpack.monitoring.elasticsearch.logQueries: true and lemme know if the queries show up

Thanks @chrisronline, it did work. I invoked the request caught in the Kibana logs and surprisingly it returned a valid response.

This is from the logs:

This is the request I invoked (taken from the logs) and the response:

Thanks for the assistance!

Hey @Barak,

So the the manual request you ran in the last reply, that response looks strange.

The formatted query looks like:

POST .monitoring-es-6-*,.monitoring-es-7-*/_search?size=10000&ignore_unavailable=true&filter_path=hits.hits._index,hits.hits._source.cluster_uuid,hits.hits._source.cluster_name,hits.hits._source.version,hits.hits._source.license.status,hits.hits._source.license.type,hits.hits._source.license.issue_date,hits.hits._source.license.expiry_date,hits.hits._source.license.expiry_date_in_millis,hits.hits._source.cluster_stats,hits.hits._source.cluster_state,hits.hits._source.cluster_settings.cluster.metadata.display_name
{
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "term": {
      "type": {
        "value": "cluster_stats"
      }
    }
  },
  "collapse": {
    "field": "cluster_uuid"
  }
}

which is very similar to an earlier query you ran but the response looks very different.

The collapse in the body should ensure you only see a single hit per unique cluster uuid, whereas your response looks like it contains the same one, over and over.

Can you double check you ran the right query?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.