ES at 100% CPU, millions of CLUSTER_RECOVERED errors


(Scherma) #1

Hi, I've had ES running in my home lab for a while. It had been running happily for months, and I left it alone for a bit to work on some other things; when I came back to it I find that it's munching 100% CPU in the VM and not responding to anything other than basic GETs. I can't add/modify any data, and I can't stop the process from its systemd service once it's running, I have to kill -9. Here is an example of the errors which are happening (only in DEBUG mode, there's nothing significant coming up at the default logging level):

[2016-09-26 17:58:34,455][DEBUG][gateway ] [myhost] [dom-ls-2016.09.14.21][0]: throttling allocation [[dom-ls-2016.09.14.21][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-09-26T16:58:17.887Z]]] to [[{myhost}{T_VDx6VqR-i0DtlkhYxXKQ}{10.0.19.4}{10.0.19.4:9300}]] on primary allocation

This is running on Debian Jessie, VM has access to 2 cores and 4GB RAM; just running a single primary node, nothing else in the cluster. Errors are occurring regardless of whether LS and Kibana are running. I don't know what to look for next. This COULD have been due to me snarling up the config somehow, but equally there are other things, e.g. my ESXi box doesn't have UPS and I've had more than one power cut.

I've tried asking in the IRC channel on Freenode and unfortunately nobody there seemed to be able to help. Hopefully here I will find someone who knows how to help me fix it. Thanks in advance!


(Mark Walkom) #2

What version?


(Scherma) #3

My bad, 2.2.0.


(Mark Walkom) #4

What's in _cat/pendingtasks?


(Scherma) #5

32 of the following:

1440 4.8s URGENT shard-started ([dom-ls-2016.05.21.08][0], node[EX_53xOeRPKKu04ljCJAHw], [P], v[11], s[INITIALIZING], a[id=IGSzbbn-RLGwvOiTUWtpsw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-01T11:11:30.741Z]]), reason [master {moose}{EX_53xOeRPKKu04ljCJAHw}{10.0.19.4}{10.0.19.4:9300} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]


(Mark Walkom) #6

Anything interesting in the other _cat endpoints, eg recovery, allocation?


(Scherma) #7

I'm too new to ES to know what qualifies as interesting. All items in _cat/recovery say 'store done' so I'm guessing nothing there. Output of _cat/allocation is:

2255 9.4gb 26.3gb 121.2gb 147.5gb 17 10.0.19.4 10.0.19.4 moose
2409 UNASSIGNED


(Mark Walkom) #8

Ok problem is you need to reduce your shard count, quite a lot.
You are going to be wasting lots of resources just maintaining that many shards.


(Scherma) #9

My Logstash pipeline was set to create hourly indexes, with one shard per index. Warkolm has suggested that this is the cause of the problem and that the solution would be to re-index to monthly. However, although I have stopped ES from producing lots of Java heap errors (see here), running the reindex suggested doesn't seem to be working - I should be getting dots output and I'm getting nothing. I suspect that this is an ES rather than logstash problem because when I have tried individual CURL requests, only the GETs have worked - PUT and DELETE fail.


(Mark Walkom) #10

What does your config look like?


(Scherma) #11

Assuming you mean /etc/elasticsearch.yml:

cluster.name: elkmeup
node.name: moose
path.data: /opt/elasticsearch
path.logs: /opt/logs/elasticsearch
network.host: _site_
network.bind_host: ["site", "local"]

That's it, nothing fancy.


(Mark Walkom) #12

LS, that's where you think this issue is currently at?


(Scherma) #13

No, there is definitely an issue with ES because I can't PUT/DELETE regardless of whether or not LS is running.


(Scherma) #14

Further troubleshooting: I have (eventually) become able to do POST/PUT/DELETE operations. I don't know for certain what made the difference but I suspect it was that ES simply needed enough time to process its recovery operations (I had been shutting it down between troubleshooting so as not to be stressing my VM host too much).

I have now started reindexing to a more sensible frame (monthly instead of hourly), and I understand that shard count and index count have a significant impact on cluster performance and resource usage - but I would like some help understanding how and why this is the case, so that I can better judge how to manage index creation in future. Please could you enlighten me on this front?


(system) #15