Cascading cluster failure

Kris_Davey · December 29, 2014, 9:09pm

Mark,

We have 18 data nodes across 9 physical servers, each has 128GB of RAM and
40 Cores. All nodes are currently datanodes and all are able to be master
nodes. We are getting ready to change this however and put in 3 dedicated
master nodes.

-Kris

On Thursday, December 25, 2014 5:01:44 PM UTC-7, Mark Walkom wrote:

3 million requests a second!

Can you provide some details on your cluster, ie node type count?

On 26 December 2014 at 05:14, vineeth mohan <vm.vine...@gmail.com
<javascript:>> wrote:

Hi ,

That could be a good reason.
But then it wont happen without you change the threadpool settings for
index.
If there is load more than it can process , it will go to the queue.
And the queue is by default 20 per node also goes full the requests get
rejected.
Can you see if there is any rejected index requests in the process.

But again , that wont help if you haven't changes the threadpool settings.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 11:39 PM, Abhishek <abhis...@gmail.com
<javascript:>> wrote:

Thanks Vineeth. I was thinking about that but these merge errors were
also happening before the outage. Also the elasticsearch process was never
dead. Also I was wondering at the time of outage we had about 3 million
requests per second. Just the large number of requests caused the network
layer go crazy ? Because everything recovered in 7 minutes.

Sent from my iPhone

On Dec 25, 2014, at 11:12 PM, vineeth mohan <vm.vine...@gmail.com
<javascript:>> wrote:

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <
abhis...@gmail.com <javascript:>> wrote:

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally)
have closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much
data, how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I
couldn't find any old gc during the outage. Each es node has 32 gigs. Each
box has 128gigs split between 2 es nodes(32G each) and file system cache
(64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <
vm.vine...@gmail.com> wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection
and the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42
the cluster went red and recovered it self. I can't seem to find any
obvious logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX]
left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX]
left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX]
left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX]
left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX]
left

All I can find are some info logs, when the master is elected
and "reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ]
[E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
[inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time.
All I can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left
: *

[2014-12-23 17:38:20,117][WARN ][index.translog ]
[E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flu
sh(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBas
edFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
dexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
er.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ]
[E0007-1] Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
eeContext]

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeActi
on$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThe
nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and
green. Can these errors cause the nodes to not respond making the cluster
think the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21
c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email
to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3
DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%
3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a751d8d7-aca2-4d05-b058-123cb5323560%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Out of memory, missing shards, looks like split-brain Elasticsearch	10	850	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	565	July 6, 2017
Simultaneous OutOfMemoryErrors across multiple nodes in cluster Elasticsearch	4	358	July 6, 2017
Node experiencing relatively high CPU usage Elasticsearch	27	4159	July 6, 2017
Java.lang.OutOfMemoryError causing cluster to fail Elasticsearch	2	647	July 6, 2017

Cascading cluster failure

Related topics