Index recovery failure on node restart since v1.3.x

Ankush_Jhalani · October 7, 2014, 8:56pm

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot threads
dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · October 7, 2014, 9:00pm

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush.jhalani@gmail.com wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot threads
dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bFFFJ3mLmFxn2fYZmb%2BAH0k9aZzJ_bK6dVoR-sOQZ%2B6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ankush_Jhalani · October 7, 2014, 9:03pm

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani <ankush....@gmail.com
<javascript:>> wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
:::
[testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements

org.elasticsearch.index.engine.internal.InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)

org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:160)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:122)

org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)

org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)

org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)

org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:779)

org.elasticsearch.index.engine.internal.InternalEngine.delete(InternalEngine.java:686)

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:250)

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thibaut_Britz_2 · October 8, 2014, 10:04am

Hi,

I would open up an issue on github. Even if it's just one node,
elasticsearch should restart.

Thanks,
Thibaut

On Tue, Oct 7, 2014 at 11:03 PM, Ankush Jhalani ankush.jhalani@gmail.com
wrote:

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush....@gmail.com wrote:

We have a single node ES instance, which is restarted once a week. Every
time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ] [testnode]
[testindex_20140930][0] interval [5s], flush_threshold_ops [2147483647],
flush_threshold_size [200mb], flush_threshold_period [30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ] [testnode]
[testindex_20140930][0] state: [CREATED]->[RECOVERING], reason [from
gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ] [testnode]
[testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ] [testnode]
[testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
::: [testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements
org.elasticsearch.index.engine.internal.
InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)
org.apache.lucene.search.SearcherManager.getSearcher(
SearcherManager.java:160)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:122)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(
ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefresh(
ReferenceManager.java:225)
org.elasticsearch.index.engine.internal.InternalEngine.refresh(
InternalEngine.java:779)
org.elasticsearch.index.engine.internal.InternalEngine.delete(
InternalEngine.java:686)
org.elasticsearch.index.shard.service.InternalIndexShard.
performRecoveryOperation(InternalIndexShard.java:780)
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.
recover(LocalIndexShardGateway.java:250)
org.elasticsearch.index.gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.java:132)
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still happening
with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE_AicSdFYAAh7JXQvgexQwq8rqwzegpQbW6AwRSX0v%2BNDYHGQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ankush_Jhalani · October 8, 2014, 3:07pm

Thanks. It's difficult to replicate w/o the data but I will try to ask on
github.

On Wednesday, October 8, 2014 6:04:52 AM UTC-4, Thibaut wrote:

Hi,

I would open up an issue on github. Even if it's just one node,
elasticsearch should restart.

Thanks,
Thibaut

On Tue, Oct 7, 2014 at 11:03 PM, Ankush Jhalani <ankush....@gmail.com
<javascript:>> wrote:

Well it's a shared resource (not prod), used for other stuff and due to
historical/enterprise reasons it's bounced every week. Though not ideal, I
expect ES to be able to restart without issues.

On Tuesday, October 7, 2014 5:01:15 PM UTC-4, Mark Walkom wrote:

Why are you restarting the node every week?
That sounds like a problem you should solve to stop this one happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 8 October 2014 07:56, Ankush Jhalani ankush....@gmail.com wrote:

We have a single node ES instance, which is restarted once a week.
Every time it's restarted, one specific index recovery is always stuck at -

[2014-10-06 22:47:48,107][DEBUG][index.translog ]
[testnode] [testindex_20140930][0] interval [5s], flush_threshold_ops
[2147483647], flush_threshold_size [200mb], flush_threshold_period
[30m]
[2014-10-06 22:47:48,108][DEBUG][index.shard.service ]
[testnode] [testindex_20140930][0] state: [CREATED]->[RECOVERING], reason
[from gateway]
[2014-10-06 22:47:48,108][DEBUG][index.gateway ]
[testnode] [testindex_20140930][0] starting recovery from local ...
[2014-10-06 22:47:48,203][DEBUG][index.engine.internal ]
[testnode] [testindex_20140930][0] starting engine

We have to delete that index for recovery to complete. Doing hot
threads dump, we get following logs -
::: [testnode.node][ff9m9KnRSqWfkrTZiAMbsA][testnode][inet[/10.126.143.197:9301]]{datacenter=nj,
master=true}

102.9% (514.3ms out of 500ms) cpu usage by thread
'elasticsearch[testnode.node][generic][T#2]'
10/10 snapshots sharing following 14 elements
org.elasticsearch.index.engine.internal.
InternalEngine$SearchFactory.newSearcher(InternalEngine.java:1574)
org.apache.lucene.search.SearcherManager.getSearcher(
SearcherManager.java:160)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:122)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(
SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(
ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefresh(
ReferenceManager.java:225)
org.elasticsearch.index.engine.internal.InternalEngine.refresh(
InternalEngine.java:779)
org.elasticsearch.index.engine.internal.InternalEngine.delete(
InternalEngine.java:686)
org.elasticsearch.index.shard.service.InternalIndexShard.
performRecoveryOperation(InternalIndexShard.java:780)
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.
recover(LocalIndexShardGateway.java:250)
org.elasticsearch.index.gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.java:132)
java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

We started seeing this error with upgrade to v1.3.2, and still
happening with v1.3.4. Could someone advice what could be happening? Thanks.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/584e7b07-0957-49ca-b67a-3f8dc281312a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/327c2b19-109a-4f42-9031-93a2c8c275e9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e68dff51-c65f-4149-b693-048011326a73%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Disappearing Shards Elasticsearch	10	411	July 6, 2017
Why ES node starts recovering all the data from other nodes after reboot? Elasticsearch	17	545	July 6, 2017
Apparent infinite loop in node recovery Elasticsearch	16	952	July 6, 2017
Restarting an active node without needing to recover all data remotely Elasticsearch	13	5153	July 6, 2017
Initial recovery failing "Unknown alias name was passed to alias Filter" Elasticsearch	6	529	July 6, 2017

Index recovery failure on node restart since v1.3.x

Related topics