Monitoring interface unable to find the cluster

monitoring

(Jon) #1

We are currently running ELK 6.4.2 with OpenJDK 1.8.0-191 on CentOS 7.5. This cluster currently contains 30 billion docs, 9820 shards, 919 indices, 90TB of data.

We were previously running 6.2.4, and one day while we were upgrading memory on the ES nodes something went sideways with the x-pack monitoring, The monitoring indices are still writing as expected but the interface complains that it cannot find the cluster. After some troubleshooting we were unable to get it working again, but we were already planning to upgrade to 6.4.x so figured we would wait till then to really dig deeper. Fast forward, we have completed upgrading all our clusters to 6.4.2, and the monitoring page on this particular cluster is still broken.

The error we see is:
"Monitoring Request Failed
Unable to find the cluster in the selected time range. UUID: pKuY7ygvSGGF8iAR-rrVQA
HTTP 404"

I have looked through the data in the monitoring-* indexes, every document has a cluster_uuid field containing "pKuY7ygvSGGF8iAR-rrVQA".

The data in the monitoring-* indices is right up to date, I can browse the data in Kibana just fine, is seems there is some kind of mismatch between the data and what the monitoring page is expecting to find. The Error comes up almost immediately, there is no delay as if something was timing out. The monitoring indices contain 3 shards plus one replica. Increasing the time range past 1 hour still does not return any results.

Any help or tips on what to look for is appreciated.


(Shaunak Kashyap) #2

Would you mind running the following ES query and posting the results here?

POST .monitoring-es-6-*/_search
{
  "size": 0,
  "aggs": {
    "type": {
      "terms": {
        "field": "type",
        "size": 20
      },
      "aggs": {
        "cluster": {
          "terms": {
            "field": "cluster_uuid",
            "size": 5
          }
        }
      }
    }
  }
}

(Jon) #3

Here is the output of requested query:

{
  "took": 175,
  "timed_out": false,
  "_shards": {
    "total": 21,
    "successful": 21,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 32675226,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "index_stats",
          "doc_count": 17200882,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 17200882
              }
            ]
          }
        },
        {
          "key": "shards",
          "doc_count": 15059296,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 15059296
              }
            ]
          }
        },
        {
          "key": "node_stats",
          "doc_count": 369505,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 369505
              }
            ]
          }
        },
        {
          "key": "index_recovery",
          "doc_count": 24787,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 24787
              }
            ]
          }
        },
        {
          "key": "indices_stats",
          "doc_count": 19595,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 19595
              }
            ]
          }
        },
        {
          "key": "cluster_stats",
          "doc_count": 1161,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 1161
              }
            ]
          }
        }
      ]
    }
  }
}

Thanks


(Shaunak Kashyap) #4

Thanks, all of those look normal. Could you tweak that query a bit as shown below and post the results, please?

POST .monitoring-es-6-*/_search
{
  "size": 0,
  "aggs": {
    "type": {
      "terms": {
        "field": "type",
        "size": 20
      },
      "aggs": {
        "cluster": {
          "terms": {
            "field": "cluster_uuid",
            "size": 5
          },
          "aggs": {
            "latest_doc": {
              "top_hits": {
                "size": 1,
                "_source": "timestamp",
                "sort": [
                  {
                    "timestamp": {
                      "order": "desc"
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

Also, could you post a screenshot of your Kibana Monitoring Cluster Overview page as well as copy-paste the URL from the browser window while you're on that page?

Thanks.


(Jon) #5

Screenshot of monitoring page:

URL:
https://kibana.domain.com/app/monitoring#/no-data?_g=h@44136fa

The query output is too large for one post, will try to break it up.


(Jon) #6

Query output part 1

{
  "took": 198,
  "timed_out": false,
  "_shards": {
    "total": 21,
    "successful": 21,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 32723959,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "type": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "index_stats",
          "doc_count": 17248722,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 17248722,
                "latest_doc": {
                  "hits": {
                    "total": 17248722,
                    "max_score": null,
                    "hits": [
                      {
                        "_index": ".monitoring-es-6-2018.11.14",
                        "_type": "doc",
                        "_id": "A_-zE2cBZ67Ykr9K9xAl",
                        "_score": null,
                        "_source": {
                          "timestamp": "2018-11-14T19:30:18.915Z"
                        },
                        "sort": [
                          1542223818915
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },
        {
          "key": "shards",
          "doc_count": 15059296,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 15059296,
                "latest_doc": {
                  "hits": {
                    "total": 15059296,
                    "max_score": null,
                    "hits": [
                      {
                        "_index": ".monitoring-es-6-2018.11.14",
                        "_type": "doc",
                        "_id": "-PIaNHW1RJyPqs4TbdHOYA:DzcEcbBISHSQP36zu582Bw:logstash-2018.02.17:3:r",
                        "_score": null,
                        "_source": {
                          "timestamp": "2018-11-14T19:30:19.754Z"
                        },
                        "sort": [
                          1542223819754
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },
        {
          "key": "node_stats",
          "doc_count": 370294,
          "cluster": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "pKuY7ygvSGGF8iAR-rrVQA",
                "doc_count": 370294,
                "latest_doc": {
                  "hits": {
                    "total": 370294,
                    "max_score": null,
                    "hits": [
                      {
                        "_index": ".monitoring-es-6-2018.11.14",
                        "_type": "doc",
                        "_id": "9YO0E2cBJ06QRqoMB-9c",
                        "_score": null,
                        "_source": {
                          "timestamp": "2018-11-14T19:30:24.525Z"
                        },
                        "sort": [
                          1542223824525
                        ]
                      }
                    ]
                  }
                }
              }
            ]
          }
        },

(Jon) #7

Part 2:

{
  "key": "index_recovery",
  "doc_count": 24839,
  "cluster": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "pKuY7ygvSGGF8iAR-rrVQA",
        "doc_count": 24839,
        "latest_doc": {
          "hits": {
            "total": 24839,
            "max_score": null,
            "hits": [
              {
                "_index": ".monitoring-es-6-2018.11.14",
                "_type": "doc",
                "_id": "nP-zE2cBZ67Ykr9K9xMl",
                "_score": null,
                "_source": {
                  "timestamp": "2018-11-14T19:30:19.413Z"
                },
                "sort": [
                  1542223819413
                ]
              }
            ]
          }
        }
      }
    ]
  }
},
{
  "key": "indices_stats",
  "doc_count": 19647,
  "cluster": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "pKuY7ygvSGGF8iAR-rrVQA",
        "doc_count": 19647,
        "latest_doc": {
          "hits": {
            "total": 19647,
            "max_score": null,
            "hits": [
              {
                "_index": ".monitoring-es-6-2018.11.14",
                "_type": "doc",
                "_id": "m_-zE2cBZ67Ykr9K9xMl",
                "_score": null,
                "_source": {
                  "timestamp": "2018-11-14T19:30:18.915Z"
                },
                "sort": [
                  1542223818915
                ]
              }
            ]
          }
        }
      }
    ]
  }
},
{
  "key": "cluster_stats",
  "doc_count": 1161,
  "cluster": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "pKuY7ygvSGGF8iAR-rrVQA",
        "doc_count": 1161,
        "latest_doc": {
          "hits": {
            "total": 1161,
            "max_score": null,
            "hits": [
              {
                "_index": ".monitoring-es-6-2018.11.11",
                "_type": "doc",
                "_id": "IVGpBGcBZ67Ykr9KXP2d",
                "_score": null,
                "_source": {
                  "timestamp": "2018-11-11T21:24:23.373Z"
                },
                "sort": [
                  1541971463373
                ]
              }
            ]
          }
        }
      }
    ]
  }
}
  ]
}
  }
}

(Chris Roberson) #8

Hi @alaphoid,

Are you using a dedicated monitoring cluster or are you using your production cluster as the monitoring cluster?

More specifically, is there a config for xpack.monitoring.elasticsearch.url in your kibana.yml?


(Shaunak Kashyap) #9

Thanks for the query output. I see the problem there: most of the docs in the monitoring index are up-to-date. The one that isn't is of type = cluster_stats, which has the data necessary to generate the opening pages of the Monitoring UI in Kibana.

Are you seeing any errors in your Elasticsearch logs, especially any that mention something about a cluster stats collector?

Thanks.


(Jon) #10

We are not using a dedicated monitoring cluster, we do not have that config line.


(Shaunak Kashyap) #11

@alaphoid, when you look in these logs, you might want to look around the timestamp of 2018-11-11T21:24:23.373Z.


(Jon) #13

That would have been around the time this cluster was upgraded from 6.2.4 to 6.4.2, will see if I can find anything useful.


(Jon) #14

After the cluster finished allocating shards post upgrade there are no messages in the Elasticsearch logs other than garbage collection INFO messages.


(Shaunak Kashyap) #15

Alright, in that case I'm not really sure why the cluster stats collector stopped working around the time of upgrade. When you performed the upgrade, did the elected master node change? The cluster stats collector only runs on the elected master node so perhaps this has something to do with it. However, there are other collectors that run only on the elected master node so I'm not sure why only the cluster stats collector would be impacted. :thinking:

Perhaps we could try to restart collection and see if that fixes the issue. Please try the following steps next:

  1. Stop all monitoring collection by running the following query against Elasticsearch:

    PUT _cluster/settings
    {
      "persistent": {
        "xpack.monitoring.collection.enabled": false
      }
    }
    
  2. Wait about 20 seconds. Re-run the query with the long output that you ran earlier. Verify that the timestamps in the output are at least 20 seconds old. This will confirm that collection has indeed stopped.

  3. Start up collection again:

    PUT _cluster/settings
    {
      "persistent": {
        "xpack.monitoring.collection.enabled": true
      }
    }
    
  4. Wait about 20 seconds. Re-run the query with the long output that you ran earlier. Verify that the timestamps in the output are current (or within the last 10 seconds). This will confirm that collection has indeed re-started. Especially verify that the timestamp nested inside the object with "key": "cluster_stats" is current.

  5. If all timestamps are current, visit the Kibana Monitoring UI and check if that's working again.

  6. If all timestamps are not current, check the Elasticsearch master node's logs for any errors and post them here.

Thanks.


(Jon) #16

Ok, I did the disable and enable, waited several minutes in between just to be sure, now I see some ClusterStatsCollector logs entries:

[2018-11-14T21:04:59,889][INFO ][o.e.c.s.ClusterSettings  ] [fast es35] updating [xpack.monitoring.collection.enabled] from [true] to [false]
[2018-11-14T21:08:07,401][INFO ][o.e.c.s.ClusterSettings  ] [fast es35] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2018-11-14T21:08:27,405][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [fast es35] collector [cluster_stats] timed out when collecting data
[2018-11-14T21:08:47,406][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [fast es35] collector [cluster_stats] timed out when collecting data
[2018-11-14T21:09:07,406][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [fast es35] collector [cluster_stats] timed out when collecting data

So it looks like a timeout is occurring, I just need to get to the bottom of that. I am seeing this timeout pretty steady every 20 seconds now.

Thanks


(Jon) #17

So I ran this to increase the timeout:

PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.cluster.stats.timeout": "15s"
  }
}

and bingo, Its working. The timeout errors disappeared from the log and the monitoring interface is now working. So basic conclusion is cluster stats is taking longer than 10 seconds to return. I assume this is due to the index/shard count on the cluster, anything I might be able to do or look at to speed it up?

Thanks


(system) #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.