We are currently running ELK 6.4.2 with OpenJDK 1.8.0-191 on CentOS 7.5. This cluster currently contains 30 billion docs, 9820 shards, 919 indices, 90TB of data.
We were previously running 6.2.4, and one day while we were upgrading memory on the ES nodes something went sideways with the x-pack monitoring, The monitoring indices are still writing as expected but the interface complains that it cannot find the cluster. After some troubleshooting we were unable to get it working again, but we were already planning to upgrade to 6.4.x so figured we would wait till then to really dig deeper. Fast forward, we have completed upgrading all our clusters to 6.4.2, and the monitoring page on this particular cluster is still broken.
The error we see is:
"Monitoring Request Failed
Unable to find the cluster in the selected time range. UUID: pKuY7ygvSGGF8iAR-rrVQA
HTTP 404"
I have looked through the data in the monitoring-* indexes, every document has a cluster_uuid field containing "pKuY7ygvSGGF8iAR-rrVQA".
The data in the monitoring-* indices is right up to date, I can browse the data in Kibana just fine, is seems there is some kind of mismatch between the data and what the monitoring page is expecting to find. The Error comes up almost immediately, there is no delay as if something was timing out. The monitoring indices contain 3 shards plus one replica. Increasing the time range past 1 hour still does not return any results.
Any help or tips on what to look for is appreciated.