Index corruption on cluster restart

Hi,

I have updated from ES 0.90.1 to 0.90.7 and everything runs fine until I
restart the cluster. I have about 130 shards in 23 indices. I am running on
Debian and Java 7, three nodes in the cluster. Most of the time when I stop
and restart at least one shard will not come up and is throwing exceptions
in the logs as being corrupted, for instance the last case is this reason:

[2013-12-11 11:33:19,544][WARN ][indices.cluster ] [Base] [1millionnew][3]
failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] failed to fetch index version after copying it over
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:136)

at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)
Caused by:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] shard allocated for local recovery (post api), should
exist, but doesn't, current files:

..... a long list of files (my shards are quite big)

at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:115)
... 4 more
Caused by: java.io.FileNotFoundException: segments_azw
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:456)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:318)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:111)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:106)
... 4 more

In this particular case, the index was inactive (no search/no indexing at
that time - for quite considerable amount of time). So I can say that the
shards are failing randomly. I have checked the number of open files limit
= 65000 on all nodes for all users. So (1) I am wondering why are the
shards failing in this particular case and (2) how can I fix the problem of
missing segments_N file - the shard is striped across 4 disks and I could
not find a pattern in existence of the segments files in particular stripes
by observing other shards for other indices?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ebf37d50-e8b8-44d0-b10e-19aef3bf823e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

You can probably set this in elasticsearch.yml to check for index during
node
startup

index.shard.check_on_startup: true

enable logging DEBUG to index.shard.service to read the output.

and then start elasticsearch in the server. Not sure if it is possible to
set the
configuration to fix and so that it will fix index. NOTE: any of these, I
have
not tested and you should really do it in test environment to avoid any
unhappy
situation.

if ("fix".equalsIgnoreCase(checkIndexOnStartup)) {
if (logger.isDebugEnabled()) {
logger.debug("fixing index, writing new segments file ...");
}
checkIndex.fixIndex(status);
if (logger.isDebugEnabled()) {
logger.debug("index fixed, wrote new segments file "{}"",
status.segmentsFileName);
}
} else {
// only throw a failure if we are not going to fix the index
if (throwException) {
throw new IndexShardException(shardId, "index check failure");
}
}

HTH

/Jason

On Thu, Dec 12, 2013 at 12:14 AM, Stanislav Barton <
stanislav.barton@gmail.com> wrote:

Hi,

I have updated from ES 0.90.1 to 0.90.7 and everything runs fine until I
restart the cluster. I have about 130 shards in 23 indices. I am running on
Debian and Java 7, three nodes in the cluster. Most of the time when I stop
and restart at least one shard will not come up and is throwing exceptions
in the logs as being corrupted, for instance the last case is this reason:

[2013-12-11 11:33:19,544][WARN ][indices.cluster ] [Base] [1millionnew][3]
failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] failed to fetch index version after copying it over
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:136)

at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)
Caused by:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] shard allocated for local recovery (post api), should
exist, but doesn't, current files:

..... a long list of files (my shards are quite big)

at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:115)
... 4 more
Caused by: java.io.FileNotFoundException: segments_azw
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:456)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:318)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at
org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:111)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:106)
... 4 more

In this particular case, the index was inactive (no search/no indexing at
that time - for quite considerable amount of time). So I can say that the
shards are failing randomly. I have checked the number of open files limit
= 65000 on all nodes for all users. So (1) I am wondering why are the
shards failing in this particular case and (2) how can I fix the problem of
missing segments_N file - the shard is striped across 4 disks and I could
not find a pattern in existence of the segments files in particular stripes
by observing other shards for other indices?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ebf37d50-e8b8-44d0-b10e-19aef3bf823e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4ityy77hRvfQkBv9rH59AeyxkYPGm7PBHQwt%3DdSiJbsFztQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

that gives a partial answer to one of the symptoms of the problem, the
missing semgents_N file, I will have a look into the code if I could use
the same mechanism to fix such a corrupted shard manually instead of
setting this for on stat-up check because the check takes ages and by
reading through the threads, most of the time when corrupted shard is
encountered it is deleted, that I would like to avoid.

On the other hand, I am having more and more shards corrupted with missing
files, in the moment, those are X.si - segment information files, small
meta information files, but if missing, the whole shard won't come up. So I
am really after what the real problem is that is causing the loss of these
meta-files on restart.

I have also very hard time restarting the cluster: I am using out of the
box settings regarding zen discovery and when I do restart the cluster most
of the time the nodes won't join into one cluster (each of them
self-electing as master). I need to individually restart the nodes several
times until I make them to join the same cluster. And once joined in one
cluster, if one of them is restarted - i am using bigdesk and head to watch
the cluster - I see besides the old node the newly created node and so the
cluster appears as having N+1 nodes. That is something that was not
happening before the update at all.

S.

On Thursday, December 12, 2013 11:23:49 AM UTC+1, Jason Wee wrote:

Hi,

You can probably set this in elasticsearch.yml to check for index during
node
startup

index.shard.check_on_startup: true

enable logging DEBUG to index.shard.service to read the output.

and then start elasticsearch in the server. Not sure if it is possible to
set the
configuration to fix and so that it will fix index. NOTE: any of these, I
have
not tested and you should really do it in test environment to avoid any
unhappy
situation.

https://github.com/elasticsearch/elasticsearch/blob/v0.90.7/src/main/java/org/elasticsearch/index/shard/service/InternalIndexShard.java

if ("fix".equalsIgnoreCase(checkIndexOnStartup)) {
if (logger.isDebugEnabled()) {
logger.debug("fixing index, writing new segments file ...");
}
checkIndex.fixIndex(status);
if (logger.isDebugEnabled()) {
logger.debug("index fixed, wrote new segments file "{}"",
status.segmentsFileName);
}
} else {
// only throw a failure if we are not going to fix the index
if (throwException) {
throw new IndexShardException(shardId, "index check failure");
}
}

HTH

/Jason

On Thu, Dec 12, 2013 at 12:14 AM, Stanislav Barton <stanisla...@gmail.com<javascript:>

wrote:

Hi,

I have updated from ES 0.90.1 to 0.90.7 and everything runs fine until I
restart the cluster. I have about 130 shards in 23 indices. I am running on
Debian and Java 7, three nodes in the cluster. Most of the time when I stop
and restart at least one shard will not come up and is throwing exceptions
in the logs as being corrupted, for instance the last case is this reason:

[2013-12-11 11:33:19,544][WARN ][indices.cluster ] [Base]
[1millionnew][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] failed to fetch index version after copying it over
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:136)

at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)
Caused by:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1millionnew][3] shard allocated for local recovery (post api), should
exist, but doesn't, current files:

..... a long list of files (my shards are quite big)

at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:115)
... 4 more
Caused by: java.io.FileNotFoundException: segments_azw
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:456)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:318)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:380)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:663)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:376)
at
org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:111)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:106)
... 4 more

In this particular case, the index was inactive (no search/no indexing at
that time - for quite considerable amount of time). So I can say that the
shards are failing randomly. I have checked the number of open files limit
= 65000 on all nodes for all users. So (1) I am wondering why are the
shards failing in this particular case and (2) how can I fix the problem of
missing segments_N file - the shard is striped across 4 disks and I could
not find a pattern in existence of the segments files in particular stripes
by observing other shards for other indices?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ebf37d50-e8b8-44d0-b10e-19aef3bf823e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/07aadc03-8d6a-479b-8192-ef6684d50c29%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.