Marvel.agent.exprorer create failure kills cluster

Hi all,
I have huge problem that when marvel cant create index it actually kills cluster.
I am using version 1.4.5 and last month we have strange issue when almost everyweek cluster dies. Each time there is in logs

[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$
[2016-04-13 12:58:48,230][ERROR][marvel.agent.exporter ] [Node1] create failure (index:[.marvel-2016.04.13] type: [index_stats]): RemoteTransportException[[Node4][inet[/IP:9300]][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsea$

Hi Volodymyr,

It looks like your cluster is overloaded. Those exceptions are saying that your bulk indexing queues are full and the cluster cannot process the indexing requests from Marvel. Given that Marvel has a consistent indexing rate (every 10 seconds by default), I would imagine that there are other things going on here that are causing the issues.

Do you perform bulk indexing actions in other processes?

Can you confirm which version of ES, and which version of Marvel you are running?

Thanks,
Steve

Hi,

So you're running ES 1.4.5, but which version of Marvel?

The errors that you're seeing relate to bulk ingestion being rejected -- it means that the bulk queue in your nodes is backed up when Marvel tries to send more stats (index_stats in this case). Specifically, it looks like Node4 is the one backed up.

This means that that node is too busy to handle any additional bulk ingestion. Take a look at that node to see why it's backing up.

Hope that helps,
Chris

I dont understand versions in Marvel so i suppose its 1.3 because i cant find any version.

As i see this is issue with ES, so next question is is this there any way to kind of skip bulk if there is too much work? because for me problem is that cluster dies and to wake it up i need to restart it.

Hi Volodymyr,

Given that this looks like a production cluster, I recommend that you setup a separate Monitoring cluster, and have Marvel send it's data to the monitoring cluster, so your monitoring data is on a different cluster than the cluster being monitored. This should reduce the load on your production cluster, and will make it easier to troubleshoot any issues with your production cluster.

To do this, you need to configure the Marvel Exporter to point at a different cluster. If you don't configure it (the default), it will store monitoring data in the local cluster.
https://www.elastic.co/guide/en/marvel/marvel-1.3/stats-export.html#stats-export

Thanks,
Steve

Thank you i will start from doing that, also i have some logging to production cluster so i will switch that to preproduction