Shard error: listed commit_point [commit-bx]/[429], but not all files exists, ignoring

Hi all,

Yesterday afternoon I started to see a cluster red state and the following
exception:

[2013-08-12 23:59:59,587][WARN ][cluster.action.shard ] [hostname]
sending failed shard for [einstein][6], node[w3Ux3oepSrWr5tP_cGWbBQ], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[einstein][6] No commit point data is
available in gateway]]]
[2013-08-12 23:59:59,904][WARN ][index.gateway.s3 ] [hostname]
[einstein][6] listed commit_point [commit-bx]/[429], but not all files
exists, ignoring

[2013-08-12 23:59:59,904][WARN ][indices.cluster ] [hostname]
[einstein][6] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[einstein][6] No commit point data is available in gateway
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:427)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

This is the first time it has ever happened. If I actually look at the
stored information for that shard, I see content like this:
s3.gateway.bucket/elasticsearch/indices/einstein/6/... (many files)
s3.gateway.bucket/elasticsearch/indices/einstein/6/__2d6
s3.gateway.bucket/elasticsearch/indices/einstein/6/__2dj
s3.gateway.bucket/elasticsearch/indices/einstein/6/commit-bx

What I also notice is that the commit-bx (a) doesn't have an entry for
__2dj and (b) does have an entry for __2di, but __2di does not exist in the
directory listing and (c) all other files seem to be accounted for.

Is there any way for me to recover from this case? Or even to get the
documents that failed to be indexed, so that I could clear the index, but
still have the source docs and re-ingest them?

I've seen other
caseshttp://elasticsearch-users.115913.n3.nabble.com/No-commit-point-data-is-available-in-gateway-td3727782.htmlon
the list that suggest that the index now needs to be completely
deleted/replaced, but at least being able to get at the documents would be
useful.

Thank you for any pointers, I'll report back if I make any progress,
oli

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I actually just managed to get my cluster back to a green state. The
resolution was very odd (I'm still not sure if *data *is resolved, but the
cluster seems happy).

Here's what I observed:

The content length of __2dj was the same as what commit-bx had for the
__2di entry:

__2dj:
1422 s3.gateway.bucket/elasticsearch/indices/einstein/6/__2dj

commit-bx
"__2di" : {
"physical_name" : "segments_3h",
"length" : 1422
},

I guessed that these may be referring to the same thing, just that how
names got assigned was weird, so I decided to try and rename the __2dj file
to __2di, to see if commit-bx would be happier.
I did this, which prompted something to then rename the file back to
__2dj on my behalf, but which also seemed to replace the __2di entry in
commit-bx with an equivalent __2dj entry. It also bumped the version number
in commit-bx from 429 -> 430.

As mentioned, the cluster now seems happy, but I need to verify if there
are any problems or loss.

Anyone with any insight into what may have happened, or if the current
state is ok, I'd be very interested to hear what you have to say!

Thanks,
oli

On Tuesday, August 13, 2013 11:19:22 AM UTC-7, Oli wrote:

Hi all,

Yesterday afternoon I started to see a cluster red state and the following
exception:

[2013-08-12 23:59:59,587][WARN ][cluster.action.shard ] [hostname]
sending failed shard for [einstein][6], node[w3Ux3oepSrWr5tP_cGWbBQ], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[einstein][6] No commit point data is
available in gateway]]]
[2013-08-12 23:59:59,904][WARN ][index.gateway.s3 ] [hostname]
[einstein][6] listed commit_point [commit-bx]/[429], but not all files
exists, ignoring

[2013-08-12 23:59:59,904][WARN ][indices.cluster ] [hostname]
[einstein][6] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[einstein][6] No commit point data is available in gateway
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:427)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

This is the first time it has ever happened. If I actually look at the
stored information for that shard, I see content like this:
s3.gateway.bucket/elasticsearch/indices/einstein/6/... (many files)
s3.gateway.bucket/elasticsearch/indices/einstein/6/__2d6
s3.gateway.bucket/elasticsearch/indices/einstein/6/__2dj
s3.gateway.bucket/elasticsearch/indices/einstein/6/commit-bx

What I also notice is that the commit-bx (a) doesn't have an entry for
__2dj and (b) does have an entry for __2di, but __2di does not exist in the
directory listing and (c) all other files seem to be accounted for.

Is there any way for me to recover from this case? Or even to get the
documents that failed to be indexed, so that I could clear the index, but
still have the source docs and re-ingest them?

I've seen other caseshttp://elasticsearch-users.115913.n3.nabble.com/No-commit-point-data-is-available-in-gateway-td3727782.htmlon the list that suggest that the index now needs to be completely
deleted/replaced, but at least being able to get at the documents would be
useful.

Thank you for any pointers, I'll report back if I make any progress,
oli

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Not for the faint hearted. Congrats!

Can't help, but this smells like the JVM was unable to use file descriptors
to write stuff out and failed silently.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Jörg,

To follow up, I confirmed that the index appears untainted. I probably got
lucky in that the index was extremely low traffic, so even just
investigating the files maintained by the index was reasonably parseable.

It's still unclear to me if the raw docs themselves are written somewhere
that I could recover them (even if I had to rebuild the index, knowing I
hadn't lost docs would be excellent!). If I find out, I'll post back.

  • oli

On Tuesday, August 13, 2013 1:44:10 PM UTC-7, Jörg Prante wrote:

Not for the faint hearted. Congrats!

Can't help, but this smells like the JVM was unable to use file
descriptors to write stuff out and failed silently.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.