Cascading cluster failure

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances of
elasticsearch. And one vm as balancer for indexing. The CPU is normal and
memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and "reason:
zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I can
find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1] Failed
to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green. Can
these errors cause the nodes to not respond making the cluster think the
node is dead and elect a new master and so forth ? If not I was wondering
If I can get some pointers on where to look ? Or what might have happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for sometime
and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <abhishek376@gmail.com

wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances of
elasticsearch. And one vm as balancer for indexing. The CPU is normal and
memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I can
find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green. Can
these errors cause the nodes to not respond making the cluster think the
node is dead and elect a new master and so forth ? If not I was wondering
If I can get some pointers on where to look ? Or what might have happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for reading vineeth. That was my initial thought but I couldn't find
any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vineethmohan@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for sometime
and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhishek376@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances of
elasticsearch. And one vm as balancer for indexing. The CPU is normal and
memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I can
find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green. Can
these errors cause the nodes to not respond making the cluster think the
node is dead and elect a new master and so forth ? If not I was wondering
If I can get some pointers on where to look ? Or what might have happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

You should drop your heap to 31GB, over that and you lose some performance
and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data, how
many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhishek376@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vineethmohan@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhishek376@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances
of elasticsearch. And one vm as balancer for indexing. The CPU is normal
and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I
can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green. Can
these errors cause the nodes to not respond making the cluster think the
node is dead and elect a new master and so forth ? If not I was wondering
If I can get some pointers on where to look ? Or what might have happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_eeM-i-Mjk8ZXgYSaLgCz2t6TU-4x97NujHMMhxe9NOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Wed, Dec 24, 2014 at 2:03 PM, Mark Walkom markwalkom@gmail.com wrote:

You should drop your heap to 31GB, over that and you lose some performance
and actual heap stack due to uncompressed pointers.

I believe the magic number is < 32GB:
http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#compressedOop

I wonder if there is a way to get whether the jvm triggered the setting
somehow from JMX or jstat or something.

Nik

it looks like a node, or nodes, dropped out due to GC. How much data, how
many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhishek376@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <vm.vineethmohan@gmail.com

wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhishek376@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances
of elasticsearch. And one vm as balancer for indexing. The CPU is normal
and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I
can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green.
Can these errors cause the nodes to not respond making the cluster think
the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_eeM-i-Mjk8ZXgYSaLgCz2t6TU-4x97NujHMMhxe9NOw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_eeM-i-Mjk8ZXgYSaLgCz2t6TU-4x97NujHMMhxe9NOw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2K6o79Om_0kMYdPsO0cYoMXN1NBHKR-f7LX-6Vap-%3DMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some performance
and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data, how
many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek <abhis...@gmail.com <javascript:>>
wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <vm.vine...@gmail.com
<javascript:>> wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com <javascript:>> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances
of elasticsearch. And one vm as balancer for indexing. The CPU is normal
and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I
can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green.
Can these errors cause the nodes to not respond making the cluster think
the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally) have
closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqlasylum@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data, how
many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vine...@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the
time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances
of elasticsearch. And one vm as balancer for indexing. The CPU is normal
and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous [E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed {[E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I
can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.
flush(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$
TranslogBasedFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(
IndexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(
IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$
BaseAsyncAction.releaseIrrelevantSearchContext
s(TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchA
ction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green.
Can these errors cause the nodes to not respond making the cluster think
the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%
3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%
2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_pxanfoNqGjRMiTwnsTE5Y_PTOodQJQK97syDx%2BF7kUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally) have
closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright <sqla...@gmail.com <javascript:>>
wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data,
how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vine...@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and
the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous [E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed {[E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All I
can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.
flush(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$
TranslogBasedFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(
IndexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(
IndexWriter.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/
freeContext]

at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$
BaseAsyncAction.releaseIrrelevantSearchContext
s(TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green.
Can these errors cause the nodes to not respond making the cluster think
the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%
3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%
2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <abhishek376@gmail.com

wrote:

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally) have
closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data,
how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I couldn't
find any old gc during the outage. Each es node has 32 gigs. Each box has
128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vine...@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and
the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became
master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ]
[E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
[inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All
I can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ]
[E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flu
sh(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBas
edFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
dexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
er.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1]
Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
eeContext]

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$
BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThe
nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green.
Can these errors cause the nodes to not respond making the cluster think
the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21
c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DT
g6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2
BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Vineeth. I was thinking about that but these merge errors were also happening before the outage. Also the elasticsearch process was never dead. Also I was wondering at the time of outage we had about 3 million requests per second. Just the large number of requests caused the network layer go crazy ? Because everything recovered in 7 minutes.

Sent from my iPhone

On Dec 25, 2014, at 11:12 PM, vineeth mohan vm.vineethmohan@gmail.com wrote:

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu abhishek376@gmail.com wrote:
Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid uncompressed pointers. We ideally double our cluster every year the number of shards is plan for future growth. And the way documents are spread across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:
That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally) have closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:
Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:
You should drop your heap to 31GB, over that and you lose some performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data, how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:
Thanks for reading vineeth. That was my initial thought but I couldn't find any old gc during the outage. Each es node has 32 gigs. Each box has 128gigs split between 2 es nodes(32G each) and file system cache (64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vine...@gmail.com wrote:
Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and the time this anomaly happens.
Chances are that the stop the world time might block the ping for sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu abhis...@gmail.com wrote:
Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42 the cluster went red and recovered it self. I can't seem to find any obvious logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two instances of elasticsearch. And one vm as balancer for indexing. The CPU is normal and memory usage is below 75%

Heap during the outage

Heap once stable.

Below are list of events that happened according to marvel :

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX] became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX] became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and "reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2, master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2, master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2, master=true})

I couldn't find any other errors or warnings around this time. All I can find is OOM errors which I found that is also happening before.

I found similar logs in all the nodes just before the node left :

[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException: [xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)

Also, found some transport expections, which are not new

2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1] Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException: [E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]

at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and green. Can these errors cause the nodes to not respond making the cluster think the node is dead and elect a new master and so forth ? If not I was wondering If I can get some pointers on where to look ? Or what might have happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi ,

That could be a good reason.
But then it wont happen without you change the threadpool settings for
index.
If there is load more than it can process , it will go to the queue.
And the queue is by default 20 per node also goes full the requests get
rejected.
Can you see if there is any rejected index requests in the process.

But again , that wont help if you haven't changes the threadpool settings.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 11:39 PM, Abhishek abhishek376@gmail.com wrote:

Thanks Vineeth. I was thinking about that but these merge errors were also
happening before the outage. Also the elasticsearch process was never dead.
Also I was wondering at the time of outage we had about 3 million requests
per second. Just the large number of requests caused the network layer go
crazy ? Because everything recovered in 7 minutes.

Sent from my iPhone

On Dec 25, 2014, at 11:12 PM, vineeth mohan vm.vineethmohan@gmail.com
wrote:

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <
abhishek376@gmail.com> wrote:

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally)
have closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data,
how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I
couldn't find any old gc during the outage. Each es node has 32 gigs. Each
box has 128gigs split between 2 es nodes(32G each) and file system cache
(64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan vm.vine...@gmail.com
wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and
the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42
the cluster went red and recovered it self. I can't seem to find any
obvious logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ]
[E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
[inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time. All
I can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left : *

[2014-12-23 17:38:20,117][WARN ][index.translog ]
[E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flu
sh(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBas
edFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
dexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
er.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ]
[E0007-1] Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
eeContext]

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$
BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThe
nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and
green. Can these errors cause the nodes to not respond making the cluster
think the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21
c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3
DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2
BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

3 million requests a second!

Can you provide some details on your cluster, ie node type count?

On 26 December 2014 at 05:14, vineeth mohan vm.vineethmohan@gmail.com
wrote:

Hi ,

That could be a good reason.
But then it wont happen without you change the threadpool settings for
index.
If there is load more than it can process , it will go to the queue.
And the queue is by default 20 per node also goes full the requests get
rejected.
Can you see if there is any rejected index requests in the process.

But again , that wont help if you haven't changes the threadpool settings.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 11:39 PM, Abhishek abhishek376@gmail.com wrote:

Thanks Vineeth. I was thinking about that but these merge errors were
also happening before the outage. Also the elasticsearch process was never
dead. Also I was wondering at the time of outage we had about 3 million
requests per second. Just the large number of requests caused the network
layer go crazy ? Because everything recovered in 7 minutes.

Sent from my iPhone

On Dec 25, 2014, at 11:12 PM, vineeth mohan vm.vineethmohan@gmail.com
wrote:

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <
abhishek376@gmail.com> wrote:

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally)
have closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much data,
how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I
couldn't find any old gc during the outage. Each es node has 32 gigs. Each
box has 128gigs split between 2 es nodes(32G each) and file system cache
(64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <vm.vine...@gmail.com

wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection and
the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42
the cluster went red and recovered it self. I can't seem to find any
obvious logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX]
left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left

All I can find are some info logs, when the master is elected and
"reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ]
[E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
[inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time.
All I can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left
: *

[2014-12-23 17:38:20,117][WARN ][index.translog ]
[E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flu
sh(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBas
edFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
dexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
er.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ]
[E0007-1] Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
eeContext]

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeAction$
BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThe
nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and
green. Can these errors cause the nodes to not respond making the cluster
think the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21
c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3
DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%
3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-EZK6hSuitJZo7cRYEgoGpC2%3Dqh2iG%2BSkRZXayUztFsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mark,

We have 18 data nodes across 9 physical servers, each has 128GB of RAM and
40 Cores. All nodes are currently datanodes and all are able to be master
nodes. We are getting ready to change this however and put in 3 dedicated
master nodes.

-Kris

On Thursday, December 25, 2014 5:01:44 PM UTC-7, Mark Walkom wrote:

3 million requests a second!

Can you provide some details on your cluster, ie node type count?

On 26 December 2014 at 05:14, vineeth mohan <vm.vine...@gmail.com
<javascript:>> wrote:

Hi ,

That could be a good reason.
But then it wont happen without you change the threadpool settings for
index.
If there is load more than it can process , it will go to the queue.
And the queue is by default 20 per node also goes full the requests get
rejected.
Can you see if there is any rejected index requests in the process.

But again , that wont help if you haven't changes the threadpool settings.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 11:39 PM, Abhishek <abhis...@gmail.com
<javascript:>> wrote:

Thanks Vineeth. I was thinking about that but these merge errors were
also happening before the outage. Also the elasticsearch process was never
dead. Also I was wondering at the time of outage we had about 3 million
requests per second. Just the large number of requests caused the network
layer go crazy ? Because everything recovered in 7 minutes.

Sent from my iPhone

On Dec 25, 2014, at 11:12 PM, vineeth mohan <vm.vine...@gmail.com
<javascript:>> wrote:

Hello Abhishek ,

Can you try to correlate merge operation of shards and this time of
cascading failures ?
I feel there is a correlation between both.
If so , we can do some optimization on that side.

Thanks
Vineeth

On Thu, Dec 25, 2014 at 8:53 AM, Abhishek Andhavarapu <
abhis...@gmail.com <javascript:>> wrote:

Mark,

Thanks for reading. Our heap sizes are less than 32 gigs to avoid
uncompressed pointers. We ideally double our cluster every year the number
of shards is plan for future growth. And the way documents are spread
across all the nodes in the cluster etc..

Thanks,
Abhishek

On Thursday, December 25, 2014 2:05:22 AM UTC+5:30, Mark Walkom wrote:

That's a pretty big number of shards, why is it so high?
The recommended there is one shard per node, so you should (ideally)
have closer to 6600 shards.

On 25 December 2014 at 07:07, Pat Wright sqla...@gmail.com wrote:

Mark,

I work on the cluster as well so i can answer the size/makeup.
Data: 580GB
Shards: 10K
Indices: 347
ES version: 1.3.2

Not sure the Java version.

Thanks for getting back!

pat

On Wednesday, December 24, 2014 12:04:03 PM UTC-7, Mark Walkom wrote:

You should drop your heap to 31GB, over that and you lose some
performance and actual heap stack due to uncompressed pointers.

it looks like a node, or nodes, dropped out due to GC. How much
data, how many indexes do you have? What ES and java versions?

On 24 December 2014 at 22:29, Abhishek abhis...@gmail.com wrote:

Thanks for reading vineeth. That was my initial thought but I
couldn't find any old gc during the outage. Each es node has 32 gigs. Each
box has 128gigs split between 2 es nodes(32G each) and file system cache
(64G).

On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan <
vm.vine...@gmail.com> wrote:

Hi ,

What is the memory for each of these machines ?
Also see if there is any correlation between garbage collection
and the time this anomaly happens.
Chances are that the stop the world time might block the ping for
sometime and the cluster might feel some nodes are gone.

Thanks
Vineeth

On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu <
abhis...@gmail.com> wrote:

Hi all,

We recently had a cascading cluster failure. From 16:35 to 16:42
the cluster went red and recovered it self. I can't seem to find any
obvious logs around this time.

The cluster has about 19 nodes. 9 physical boxes running two
instances of elasticsearch. And one vm as balancer for indexing. The CPU
is normal and memory usage is below 75%

https://lh6.googleusercontent.com/-LxiBa8_BUhk/VJqaEowJpyI/AAAAAAAABVc/eiv930wrrrs/s1600/heap_outage.png

Heap during the outage

https://lh3.googleusercontent.com/-es_kSoeeK3o/VJqaKzQdEiI/AAAAAAAABVk/l4Il0byIORc/s1600/heap_stable.png

Heap once stable.

https://lh6.googleusercontent.com/-pZV1Js-H0Uw/VJqa79NMvYI/AAAAAAAABVs/saudhOu3Vbw/s1600/cluster_overview.png

  • Below are list of events that happened according to marvel :*

2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX]
joined

2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX]
left

2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master

2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX]
joined

2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX]
joined

2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master

2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX]
left

2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX]
left

2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX]
left

2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX]
left

2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX]
became master

2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX]
left

All I can find are some info logs, when the master is elected
and "reason: zen-disco-master_failed"

[2014-12-23 17:32:27,668][INFO ][cluster.service ]
[E0007-1] master {new [E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007]
[inet[E0007/xxx]]{rack=E0007, max_local_storage_nodes=2,
master=true}, previous [E0012-0][JpBCQSK_QKWj84OTzBaO
Xg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true}}, removed {[E0012-0][JpBCQSK_QKWj84OTzBa
OXg][E0012][inet[/xxx]]{rack=E0012, max_local_storage_nodes=2,
master=true},}, reason: zen-disco-master_failed ([E0012-0][JpBCQSK_
QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})

*I couldn't find any other errors or warnings around this time.
All I can find is OOM errors which I found that is also happening before. *

*I found similar logs in all the nodes just before the node left
: *

[2014-12-23 17:38:20,117][WARN ][index.translog ]
[E0007-1] [xxxx70246][10] failed to flush shard on translog threshold

org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed

at org.elasticsearch.index.engine.internal.InternalEngine.flush(
InternalEngine.java:868)

at org.elasticsearch.index.shard.service.InternalIndexShard.flu
sh(InternalIndexShard.java:609)

at org.elasticsearch.index.translog.TranslogService$TranslogBas
edFlush$1.run(TranslogService.java:201)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit

at org.apache.lucene.index.IndexWriter.prepareCommitInternal(In
dexWriter.java:2941)

at org.apache.lucene.index.IndexWriter.commitInternal(IndexWrit
er.java:3122)

*Also, found some transport expections, which are not new *

2014-12-23 17:37:52,328][WARN ][search.action ]
[E0007-1] Failed to send release search context

org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/fr
eeContext]

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:220)

at org.elasticsearch.transport.TransportService.sendRequest(Tra
nsportService.java:190)

at org.elasticsearch.search.action.SearchServiceTransportAction.
sendFreeContext(SearchServiceTransportAction.java:125)

at org.elasticsearch.action.search.type.TransportSearchTypeActi
on$BaseAsyncAction.releaseIrrelevantSearchContexts(
TransportSearchTypeAction.java:348)

at org.elasticsearch.action.search.type.TransportSearchQueryThe
nFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchA
ction.java:147)

at org.elasticsearch.action.search.type.TransportSea

The cluster recovered after the 7 minutes and is back up and
green. Can these errors cause the nodes to not respond making the cluster
think the node is dead and elect a new master and so forth ? If not I was
wondering If I can get some pointers on where to look ? Or what might have
happened.

Thanks,

Abhishek

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/832af056-f21
c-438c-97be-3794e97549b1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email
to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3
DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nQU%3DQWwxe0fWmn9vCORYYDEk1R2-G%3DTg6e17uKw5%2Brtw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%
3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAMG_T3eatDj%3Di3pTO5zxcx68CJoOriBSXOADO%2BfOWzqESa5Xkw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2b61807c-c6e9-44ae-8372-8d76fd664028%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/175130fd-e89c-401d-83e8-160e45e7f2be%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/_S728XU6jks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5miir_sZobUQ-ETb%2BKQb6wftZM8vvLKJYiaGzSbdRbfyQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/81AF51F3-77FD-4FB7-881C-87036DE8B046%40gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5nkauUh18e7Vx1LdR7nwjo80jDY0hUgy5NBLQTwq-T7sw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a751d8d7-aca2-4d05-b058-123cb5323560%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.