Cluster failure

I just had an incident where my entire cluster (all nodes) ended up using
100% cpu on each nod at the same time and become completely unresponsive
even to /_cluster/health. This happened while I was using Kibana, which was
working fine up to that point. I was running a few simple queries (nothing
fancy, see log below). Even the shutdown call failed. After restarting, the
cluster seems fine again (green state after a few minutes) but I'd like to
prevent similar incidents in the future. This is the second time I've seen
elasticsearch die in this way. Last time, I got some helpful suggestions to
put some limits on the field data cache, which I did but obviously that
wasn't the (only) problem.

I'm running a three node cluster with elasticsearch 1.1.1, java 1.7.0_55

I'm using logstash and kibana. I have about three months worth of indices
(5 shards, 2 replicas) and a fresh index every day. 210G in total. Kibana
loads quite slowly but runs fine once loaded.

I don't see evidence of out of memory errors in the logs but it seems
garbage collects are very slow (5s). So obviously something is leaking
memory over time. Uptime of the cluster must have been around three weeks
(since last time it died).

Here's the log just before one of the nodes died. You can see the query I
ran actually failed for valid reasons because the (small) index I was
trying to query from kibana does not actually have the timestamp field.

[2014-05-16 14:28:54,880][INFO ][monitor.jvm ] [192.168.1.13]
[gc][old][1562124][155961] duration [6.6s], collections [1]/[12.3s], total
[6.6s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young]
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor]
[43.3mb]->[38.2mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]}
[2014-05-16 14:28:54,880][DEBUG][action.search.type ] [192.168.1.13]
[deals_v5][4], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard
[true]
org.elasticsearch.search.SearchParseException: [deals_v5][4]:
from[-1],size[-1]: Parse Failure [Failed to parse source
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239478674,"to":"now"}}}]}}}}}}}},"size":0}]]
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:634)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:507)
at
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
at
org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException:
Facet [0]: (key) field [@timestamp] not found
at
org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
at
org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:622)
... 10 more
[2014-05-16 14:28:54,935][DEBUG][action.search.type ] [192.168.1.13]
[deals_v5][0], node[ObP5F1ZSQKWOJrAb6T1T0g], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard
[true]
org.elasticsearch.search.SearchParseException: [deals_v5][0]:
from[-1],size[-1]: Parse Failure [Failed to parse source
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"type:deal"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1400239468981,"to":"now"}}}]}}}}}}}},"size":0}]]
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:634)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:507)
at
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
at
org.elasticsearch.action.search.type.TransportSearchCountAction$AsyncAction.sendExecuteFirstPhase(TransportSearchCountAction.java:70)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException:
Facet [0]: (key) field [@timestamp] not found
at
org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
at
org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
at
org.elasticsearch.search.SearchService.parseSource(SearchService.java:622)
... 10 more
[2014-05-16 14:28:54,947][DEBUG][action.search.type ] [192.168.1.13]
[deals_v5][3], node[_MHmRPPoR-SzgaxT4FL8wQ], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4622b34d] lastShard
[true]
org.elasticsearch.transport.NodeDisconnectedException:
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected
[2014-05-16 14:29:00,036][INFO ][monitor.jvm ] [192.168.1.13]
[gc][old][1562125][155962] duration [5s], collections [1]/[5.1s], total
[5s]/[9.3h], memory [4.9gb]->[4.9gb]/[4.9gb], all_pools {[young]
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor]
[38.2mb]->[63.9mb]/[66.5mb]}{[old] [4.3gb]->[4.3gb]/[4.3gb]}
[2014-05-16 14:29:06,531][DEBUG][action.search.type ] [192.168.1.13]
[deals_v5][2], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard
[true]
org.elasticsearch.transport.NodeDisconnectedException:
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected
[2014-05-16 14:30:30,836][DEBUG][action.search.type ] [192.168.1.13]
[deals_v5][0], node[_MHmRPPoR-SzgaxT4FL8wQ], [R], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4e2d22c8] lastShard
[true]
org.elasticsearch.transport.NodeDisconnectedException:
[192.168.1.14][inet[/192.168.1.14:9300]][search/phase/query] disconnected

After this last message, nothing happened until I restarted the cluster.
Nothing else got logged. The shutdown call failed and the process was
killed the hard way by the init.d script.

I'm using ES_HEAP_SIZE=5G and it doesn't seem to be enough. Rather than
just blindly increasing heap size, I would like to know how I can prevent
it running out of memory to begin with. I've already restricted the
fielddata cache, but obviously something else is growing uncontrolled.
Beyond limiting the fielddata cache, I've done minimal amounts of tuning.
I'd like to have some advice on how I can configure things such that
kibana/logstash won't be able to kill my cluster. I'd prefer it to be
merely slow rather than completely dead. Slow I can detect and fix but dead
& unresponsive is a disaster. This cluster is mostly doing nothing else but
indexing the steady stream of logstash data. I occasionally look at some
kibana dashboards a couple of times per day.

Here's my config:

cluster.name: linko_elasticsearch

node.name: 192.168.1.14

force the cluster transport to use the correct network

transport.host: eth1:ipv4

discovery.zen.ping.unicast.hosts: 192.168.1.13,192.168.1.14,192.168.1.15

path.data: /opt/elasticsearch-data

prevent swapping

bootstrap.mlockall: true

indices.fielddata.cache.size: 30%
indices.fielddata.cache.expire: 1h
index.cache.filter.expire: 1h

What other settings should I be configuring here to minimize risk of es
ever running out of memory?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9fbc79a9-ce55-4a96-8351-5d9c9645ceba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.