I'm trying to understand a strange behaviour that started recently.
I'm getting messages like the following:
[2018-01-03T17:26:35,615][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-name] collector [cluster_stats] timed out when collecting data
And sometimes, at the same time, I have a gap in my data.
The load in the machine starts the rise right before the gap and has a peak right after, but after that it stays fine.
The machine is a CentOS 7.4 instance running at Azure with 8 cores and 28 GB (21 to ES) and it only runs ES as a data and master eligible node, currently being the master node, my cluster has two other nodes with the same configuration and none of them are showing any problem like that.
My logstash outputs are configured to round robin between the three nodes.
The ES version is 5.6.2 and the only change I made recently was upgrade my X-Pack license from basic to Platinum.
What can cause this behaviour? I do not have any other error messages in the log file and at the moment of the gap there were no queries being made through kibana nor API.
Sorry, I think this post can be deleted, It looks like it is an Azure problem.
It looks like I could be wrong, it could not be an Azure problem as I was thinking, the patches for the meltdown vunerability where applied and I'm still having this problem.
Every time I have gaps as the Images above, I have those timeout lines in the log file and a load spike in the monitoring, but the only thing running in the machine is the elasticsearch service.
[2018-01-17T12:40:43,242][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:42:12,793][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:42:50,135][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [server] collector [cluster_stats] timed out when collecting data
[2018-01-17T12:43:24,677][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
[2018-01-17T12:44:14,885][ERROR][o.e.x.m.c.i.IndexStatsCollector] [server] collector [index-stats] timed out when collecting data
Then I have the following scenario in the monitoring page.
My index data will have gaps
It should normally looks like this:
The only service running in this node is the elasticsearch.
I'm getting the same ERROR.Have you resolved the question.
Hello @len_carl and @pruta,
I did not have resolved the question, but found out that my problem seems to be related to the hardware where my elastic nodes are running.
They are running on Azure and even using big machines (8 cores, 28 GB of RAM) with premium disks, backed by ssd drives, I'm having a high I/O wait (around 10%), which will cause the spike in the system load and make the node unable to ingest data, causing gaps.
I have a ticket open with the Azure support, but no answer about my poor performance.
I would suggest that you guys check the I/O wait of the system on your nodes.
We had this error yesterday. Our suspicion is that elasticsearch was out of CPU power without being aware of it. It runs on containers on a powerful machine (32 cores), the container was restricted to 2, but we hadn't yet set the processors - we planned to, and now we'll actually do it. The customer did a load test, and elasticsearch came back with a timeout to most requests. A restart fixed the immediate issue.
We will set the processors parameter this and ask the customer to rerun their test to confirm if this was the issue. Until then it's only our best guess.
Hello, just to update this topic,
My error was solved, it is a hardware issue in one of the storage servers.
It seems that if you are having those timeout errors, elasticsearch is probably having problems to index data, which seems to be hardware bound, specially the disks io performance.
How did you discover which server was the problem?
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.