The machine is a CentOS 7.4 instance running at Azure with 8 cores and 28 GB (21 to ES) and it only runs ES as a data and master eligible node, currently being the master node, my cluster has two other nodes with the same configuration and none of them are showing any problem like that.
My logstash outputs are configured to round robin between the three nodes.
The ES version is 5.6.2 and the only change I made recently was upgrade my X-Pack license from basic to Platinum.
What can cause this behaviour? I do not have any other error messages in the log file and at the moment of the gap there were no queries being made through kibana nor API.
It looks like I could be wrong, it could not be an Azure problem as I was thinking, the patches for the meltdown vunerability where applied and I'm still having this problem.
Every time I have gaps as the Images above, I have those timeout lines in the log file and a load spike in the monitoring, but the only thing running in the machine is the elasticsearch service.
[2018-01-17T12:40:43,242][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:42:12,793][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:42:50,135][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [server] collector [cluster_stats] timed out when collecting data
[2018-01-17T12:43:24,677][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:44:14,885][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
Then I have the following scenario in the monitoring page.
I did not have resolved the question, but found out that my problem seems to be related to the hardware where my elastic nodes are running.
They are running on Azure and even using big machines (8 cores, 28 GB of RAM) with premium disks, backed by ssd drives, I'm having a high I/O wait (around 10%), which will cause the spike in the system load and make the node unable to ingest data, causing gaps.
I have a ticket open with the Azure support, but no answer about my poor performance.
I would suggest that you guys check the I/O wait of the system on your nodes.
We had this error yesterday. Our suspicion is that elasticsearch was out of CPU power without being aware of it. It runs on containers on a powerful machine (32 cores), the container was restricted to 2, but we hadn't yet set the processors - we planned to, and now we'll actually do it. The customer did a load test, and elasticsearch came back with a timeout to most requests. A restart fixed the immediate issue.
We will set the processors parameter this and ask the customer to rerun their test to confirm if this was the issue. Until then it's only our best guess.
My error was solved, it is a hardware issue in one of the storage servers.
It seems that if you are having those timeout errors, elasticsearch is probably having problems to index data, which seems to be hardware bound, specially the disks io performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.