Marvel index creation fails and brings down the cluster


(Elvijs Sarkans) #1

We've got a 3 node (c4.large) ElasticSearch 2.1.0 cluster (consisting of nodes I'll refer to as es-live-0, es-live-1, es-live-2) set up on EC2 AWS, which has been working nicely and feeding a webapp.

One of the nodes also hosts a kibana, which collects and displays the data sent to it by the marvel-agent.

Yesterday the marvel-agent failed to create a new marvel data index and failed with some sample logs provided below. Subsequently the cluster went down, but its status recovered to green after about 20 minutes. However, upon arriving in the office this morning I realised these were lies! Our webapp's requests were timing out and I couldn't ssh into es-live-0 even though it looked fine on EC2's monitoring dashboard. A restart fixed this, but seeing as this is our production system I'd really like to get to the bottom of this.

Upon reading this thread: Marvel high index rate, I realised that we should move the kibana to a standalone node and send marvel data to an ElasticSearch running on it. Could this be the underlying problem? To give you an idea, we've got 3 main indices used by our webapp spanning 23 shards. The total number of shards on the system is 145 with most pertaining to the marvel data. At the same time it feels like a high number of shards shouldn't render one of the nodes unresponsive or am I wrong in assuming this?

Also, if one of the nodes became unresponsive, why didn't the cluster eject it and continue as a 2-node setup?

Sample log:

Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  [2016-02-02 00:02:17,006][ERROR][marvel.agent             ] [Joshua Guthrie] background thread had an uncaught exception
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  ElasticsearchException[failed to flush exporter bulks]
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	at java.lang.Thread.run(Thread.java:745)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  [0]: index [.marvel-es-2016.02.02], type [node_stats], id [null], message [RemoteTransportException[[Corruptor][es-live-0:9300][indices:admin/create]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (acquire index lock) within 1m];]];
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  		at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  		... 3 more
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  	Caused by: ElasticsearchException[failure in bulk execution:
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  [0]: index [.marvel-es-2016.02.02], type [node_stats], id [null], message [RemoteTransportException[[Corruptor][es-live-0:9300][indices:admin/create]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (acquire index lock) within 1m];]]
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  		at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  		at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
Feb 02 00:02:13 es-live-1 elasticsearch-live.log:  		... 3 more

(system) #2