Index recovery failure on node restart since v1.3.x

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot threads
dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush.jhalani@gmail.com wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot threads
dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bFFFJ3mLmFxn2fYZmb%2BAH0k9aZzJ_bK6dVoR-sOQZ%2B6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani <ankush....@gmail.com
<javascript:>> wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

I would open up an issue on github. Even if it's just one node,
elasticsearch should restart.

Thanks,
Thibaut

On Tue, Oct 7, 2014 at 11:03 PM, Ankush Jhalani ankush.jhalani@gmail.com
wrote:

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush....@gmail.com wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
::: [testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements
org.elasticsearch.index.engine.internal.
InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)
org.apache.lucene.search.SearcherManager.getSearcher(
SearcherManager.java:160)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:122)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(
ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefresh(
ReferenceManager.java:225)
org.elasticsearch.index.engine.internal.InternalEngine.refresh(
InternalEngine.java:779)
org.elasticsearch.index.engine.internal.InternalEngine.delete(
InternalEngine.java:686)
org.elasticsearch.index.shard.service.InternalIndexShard.
performRecoveryOperation(InternalIndexShard.java:780)
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.
recover(LocalIndexShardGateway.java:250)
org.elasticsearch.index.gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.java:132)
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE_AicSdFYAAh7JXQvgexQwq8rqwzegpQbW6AwRSX0v%2BNDYHGQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks. It's difficult to replicate w/o the data but I will try to ask on
github.

On Wednesday, October 8, 2014 6:04:52 AM UTC-4, Thibaut wrote:

Hi,

I would open up an issue on github. Even if it's just one node,
elasticsearch should restart.

Thanks,
Thibaut

On Tue, Oct 7, 2014 at 11:03 PM, Ankush Jhalani <ankush....@gmail.com
<javascript:>> wrote:

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush....@gmail.com wrote:

We have a single node ES instance, which is restarted once a week.
Every time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ]
[testnode] [testindex_20140930][0] interval [5s], flush_threshold_ops
[2147483647], flush_threshold_size [200mb], flush_threshold_period
[30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ]
[testnode] [testindex_20140930][0] state: [CREATED]->[RECOVERING], reason
[from gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ]
[testnode] [testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ]
[testnode] [testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
::: [testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements
org.elasticsearch.index.engine.internal.
InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)
org.apache.lucene.search.SearcherManager.getSearcher(
SearcherManager.java:160)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:122)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(
ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefresh(
ReferenceManager.java:225)
org.elasticsearch.index.engine.internal.InternalEngine.refresh(
InternalEngine.java:779)
org.elasticsearch.index.engine.internal.InternalEngine.delete(
InternalEngine.java:686)
org.elasticsearch.index.shard.service.InternalIndexShard.
performRecoveryOperation(InternalIndexShard.java:780)
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.
recover(LocalIndexShardGateway.java:250)
org.elasticsearch.index.gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.java:132)
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still
happening with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e68dff51-c65f-4149-b693-048011326a73%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.