ES failed to recover after crash

In nel's case it was corrupted transaction log. When you run out of disk
space sometimes the last transaction cannot be fully written into
transaction log and then it fails on recovery. If you see exactly the same
error messages, you can try the following:

  • shut down elasticsearch cluster
  • find all shards that cannot recover by searching log file
  • for each shard move its non-zero length translog file into a temporary
    directory (see explanation below)
  • start elasticsearch cluster
  • if you see messages for other shards - repeat

If you see message like this:

[2012-06-22 17:36:17,165][WARN ][indices.cluster ] [Cat-Man] [*
myindex*][1] failed to start shard

It means that it cannot recover shard 1 of the index myindex on the
node Cat-Man. If you take a look at data/elasticsearch/nodes/0/indices/*
myindex/1*/translog directory, you will find files like this:
translog-123456677899 or translog-123456677899.recovering. One of them will
have non-zero length. Move it to a temporary directory and try starting the
server.

The transaction log files that you will be moving out contain your most
recently updated and indexed documents. So, these updates will be lost as a
result of this operations, but you should be able to recover the rest of
your data.

On Thursday, June 28, 2012 6:40:06 AM UTC-4, Clinton Gormley wrote:

Hiya

We are facing an isssue with ES. Our ES instance crashed due to low
disk space. After that it triggers out exceptions which are described
below..

shard]; nested: MapperParsingException[Failed to parse [Link]];
nested: JsonParseException[Illegal unquoted character ((CTRL-CHAR,
code 0)): has to be escaped using backslash to be included in string
value

I'm guessing here, but it looks like your mapping has been corrupted,
presumably by running out of disk space.

You may be able to fix it by looking in
data/nodes/0/indices/INDEXNAME/_state. But that file is compressed, I'm
not sure what format. Do you have an older copy that you might be able
to use?

And it sounds a lot like you were only running one instance, so no other
instance you could copy the mapping from

clint