ES at 100% CPU, millions of CLUSTER_RECOVERED errors

scherma · October 1, 2016, 8:52am

Hi, I've had ES running in my home lab for a while. It had been running happily for months, and I left it alone for a bit to work on some other things; when I came back to it I find that it's munching 100% CPU in the VM and not responding to anything other than basic GETs. I can't add/modify any data, and I can't stop the process from its systemd service once it's running, I have to kill -9. Here is an example of the errors which are happening (only in DEBUG mode, there's nothing significant coming up at the default logging level):

[2016-09-26 17:58:34,455][DEBUG][gateway ] [myhost] [dom-ls-2016.09.14.21][0]: throttling allocation [[dom-ls-2016.09.14.21][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-09-26T16:58:17.887Z]]] to [[{myhost}{T_VDx6VqR-i0DtlkhYxXKQ}{10.0.19.4}{10.0.19.4:9300}]] on primary allocation

This is running on Debian Jessie, VM has access to 2 cores and 4GB RAM; just running a single primary node, nothing else in the cluster. Errors are occurring regardless of whether LS and Kibana are running. I don't know what to look for next. This COULD have been due to me snarling up the config somehow, but equally there are other things, e.g. my ESXi box doesn't have UPS and I've had more than one power cut.

I've tried asking in the IRC channel on Freenode and unfortunately nobody there seemed to be able to help. Hopefully here I will find someone who knows how to help me fix it. Thanks in advance!

warkolm · October 1, 2016, 9:16am

What version?

scherma · October 1, 2016, 9:21am

My bad, 2.2.0.

warkolm · October 1, 2016, 10:04am

What's in _cat/pendingtasks?

scherma · October 1, 2016, 11:00am

32 of the following:

1440 4.8s URGENT shard-started ([dom-ls-2016.05.21.08][0], node[EX_53xOeRPKKu04ljCJAHw], [P], v[11], s[INITIALIZING], a[id=IGSzbbn-RLGwvOiTUWtpsw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-01T11:11:30.741Z]]), reason [master {moose}{EX_53xOeRPKKu04ljCJAHw}{10.0.19.4}{10.0.19.4:9300} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

warkolm · October 1, 2016, 11:05am

Anything interesting in the other _cat endpoints, eg recovery, allocation?

scherma · October 1, 2016, 11:11am

I'm too new to ES to know what qualifies as interesting. All items in _cat/recovery say 'store done' so I'm guessing nothing there. Output of _cat/allocation is:

2255 9.4gb 26.3gb 121.2gb 147.5gb 17 10.0.19.4 10.0.19.4 moose
2409 UNASSIGNED

warkolm · October 1, 2016, 11:12am

Ok problem is you need to reduce your shard count, quite a lot.
You are going to be wasting lots of resources just maintaining that many shards.

scherma · October 1, 2016, 1:07pm

My Logstash pipeline was set to create hourly indexes, with one shard per index. Warkolm has suggested that this is the cause of the problem and that the solution would be to re-index to monthly. However, although I have stopped ES from producing lots of Java heap errors (see here), running the reindex suggested doesn't seem to be working - I should be getting dots output and I'm getting nothing. I suspect that this is an ES rather than logstash problem because when I have tried individual CURL requests, only the GETs have worked - PUT and DELETE fail.

warkolm · October 1, 2016, 3:20pm

What does your config look like?

scherma · October 1, 2016, 6:07pm

Assuming you mean /etc/elasticsearch.yml:

cluster.name: elkmeup
node.name: moose
path.data: /opt/elasticsearch
path.logs: /opt/logs/elasticsearch
network.host: _site_
network.bind_host: ["site", "local"]

That's it, nothing fancy.

warkolm · October 1, 2016, 10:25pm

LS, that's where you think this issue is currently at?

scherma · October 2, 2016, 4:51pm

No, there is definitely an issue with ES because I can't PUT/DELETE regardless of whether or not LS is running.

scherma · October 4, 2016, 7:26pm

Further troubleshooting: I have (eventually) become able to do POST/PUT/DELETE operations. I don't know for certain what made the difference but I suspect it was that ES simply needed enough time to process its recovery operations (I had been shutting it down between troubleshooting so as not to be stressing my VM host too much).

I have now started reindexing to a more sensible frame (monthly instead of hourly), and I understand that shard count and index count have a significant impact on cluster performance and resource usage - but I would like some help understanding how and why this is the case, so that I can better judge how to manage index creation in future. Please could you enlighten me on this front?

Topic		Replies	Views
Cluster Failure Elasticsearch	2	261	July 6, 2017
ES 2.3.4 unresponsive during index recovery Elasticsearch	4	922	July 5, 2017
Elasticsearch is not reallocating shards after the primary shards were recovered. Unable to perform bulk indexing Elasticsearch	10	1717	April 15, 2020
Shard re-allocation taking a very long time Elasticsearch	16	8331	April 15, 2019
ES marking and sending shard failed due to failed recovery in enabling replication Elasticsearch	7	16083	July 5, 2017

ES at 100% CPU, millions of CLUSTER_RECOVERED errors

Related topics