S3 gateway recovery fail (No commit point data is available in gateway)

Hi everybody,

I have a cluster of elasticsearch in which i have 3 data node. I have
around 2 millions (1.5 GB) of documents.
cluster is of EC2 instances and each node have 6 GB RAM committed for
elasticsearch.

I am using S3 as index gateway.

It was working fine from last 28 days, and suddenly i am getting exception
and all the data nodes log files are flooded with the exception message at
end of this mail.

what i have understand that,

  1. Indices/shards in S3 bucket are corrupted, ( because if i want to create
    a new elasticsearch data node and it does not able to recover from S3 hence
    the same error message.

  2. Is there anyway, i could recover the indices in S3 ?

  3. In my hard drive, i have the indices and how could i push them in S3. so
    that my new elasticsearch date node recover the indices from S3.

  4. What is the possible reason that the indices in S3 got corrupted, so
    that i could prevent it in future. (becaus my assumption was that, though
    there is performance hit in having remote gate like S3 instead of local, i
    choose S3 as a gateway so that it will always have good state of
    indices and new elasticsearch data node will recover from it)

[2012-09-26 06:48:42,678][WARN ][cluster.action.shard ]
[pgossamerv01_slave3] sending failed shard for
[pblueprint3221423402385730][4], node[rlqethc0Rr6NRVW-6Mj1gw], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[pblueprint3221423402385730][4] No
commit point data is available in gateway]]]
[2012-09-26 06:48:42,693][WARN ][index.gateway.s3 ]
[pgossamerv01_slave3] [pblueprint3693375325864359][3] listed commit_point
[commit-f]/[15], but not all files exists, ignoring
[2012-09-26 06:48:42,693][WARN ][indices.cluster ]
[pgossamerv01_slave3] [pblueprint3693375325864359][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[pblueprint3693375325864359][3] No commit point data is available in gateway
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:427)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
[2012-09-26 06:48:42,694][WARN ][cluster.action.shard ]
[pgossamerv01_slave3] sending failed shard for
[pblueprint3693375325864359][3], node[rlqethc0Rr6NRVW-6Mj1gw], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[pblueprint3693375325864359][3] No
commit point data is available in gateway]]]
ubuntu@ip-10-68-70-193:/var/log/elasticsearch$ ls
pgossamerv01_index_search_slowlog.log pgossamerv01.log
ubuntu@ip-10-68-70-193:/var/log/elasticsearch$ tail pgossamerv01.log
[2012-09-26 06:56:29,796][WARN ][cluster.action.shard ]
[pgossamerv01_slave3] sending failed shard for
[pblueprint3221423402385730][4], node[rlqethc0Rr6NRVW-6Mj1gw], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[pblueprint3221423402385730][4] No
commit point data is available in gateway]]]
[2012-09-26 06:56:30,412][WARN ][index.gateway.s3 ]
[pgossamerv01_slave3] [pblueprint3221423402385730][2] listed commit_point
[commit-4]/[4], but not all files exists, ignoring
[2012-09-26 06:56:30,413][WARN ][indices.cluster ]
[pgossamerv01_slave3] [pblueprint3221423402385730][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[pblueprint3221423402385730][2] No commit point data is available in gateway
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:427)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
[2012-09-26 06:56:30,414][WARN ][cluster.action.shard ]
[pgossamerv01_slave3] sending failed shard for
[pblueprint3221423402385730][2], node[rlqethc0Rr6NRVW-6Mj1gw], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[pblueprint3221423402385730][2] No
commit point data is available in gateway]]]

--
Sujan

--