Hi, I've had ES running in my home lab for a while. It had been running happily for months, and I left it alone for a bit to work on some other things; when I came back to it I find that it's munching 100% CPU in the VM and not responding to anything other than basic GETs. I can't add/modify any data, and I can't stop the process from its systemd service once it's running, I have to kill -9. Here is an example of the errors which are happening (only in DEBUG mode, there's nothing significant coming up at the default logging level):
[2016-09-26 17:58:34,455][DEBUG][gateway ] [myhost] [dom-ls-2016.09.14.21][0]: throttling allocation [[dom-ls-2016.09.14.21][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-09-26T16:58:17.887Z]]] to [[{myhost}{T_VDx6VqR-i0DtlkhYxXKQ}{10.0.19.4}{10.0.19.4:9300}]] on primary allocation
This is running on Debian Jessie, VM has access to 2 cores and 4GB RAM; just running a single primary node, nothing else in the cluster. Errors are occurring regardless of whether LS and Kibana are running. I don't know what to look for next. This COULD have been due to me snarling up the config somehow, but equally there are other things, e.g. my ESXi box doesn't have UPS and I've had more than one power cut.
I've tried asking in the IRC channel on Freenode and unfortunately nobody there seemed to be able to help. Hopefully here I will find someone who knows how to help me fix it. Thanks in advance!
1440 4.8s URGENT shard-started ([dom-ls-2016.05.21.08][0], node[EX_53xOeRPKKu04ljCJAHw], [P], v[11], s[INITIALIZING], a[id=IGSzbbn-RLGwvOiTUWtpsw], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-01T11:11:30.741Z]]), reason [master {moose}{EX_53xOeRPKKu04ljCJAHw}{10.0.19.4}{10.0.19.4:9300} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]
I'm too new to ES to know what qualifies as interesting. All items in _cat/recovery say 'store done' so I'm guessing nothing there. Output of _cat/allocation is:
My Logstash pipeline was set to create hourly indexes, with one shard per index. Warkolm has suggested that this is the cause of the problem and that the solution would be to re-index to monthly. However, although I have stopped ES from producing lots of Java heap errors (see here), running the reindex suggested doesn't seem to be working - I should be getting dots output and I'm getting nothing. I suspect that this is an ES rather than logstash problem because when I have tried individual CURL requests, only the GETs have worked - PUT and DELETE fail.
Further troubleshooting: I have (eventually) become able to do POST/PUT/DELETE operations. I don't know for certain what made the difference but I suspect it was that ES simply needed enough time to process its recovery operations (I had been shutting it down between troubleshooting so as not to be stressing my VM host too much).
I have now started reindexing to a more sensible frame (monthly instead of hourly), and I understand that shard count and index count have a significant impact on cluster performance and resource usage - but I would like some help understanding how and why this is the case, so that I can better judge how to manage index creation in future. Please could you enlighten me on this front?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.