Rebuilding corrupted index metadata

Hey all,

I thought I would share an experience I had recently in case it helps
anybody having a similar problem:

We are in the process up upgrading from es 0.16.2 to the latest 0.17.7. We
did this by shutting down the cluster
and starting up the 0.17.7 instances pointing to our existing data
directories (we use local gateway).

We're not sure how it happened, but when the new cluster came up, it never
left red state and it showed no indices existing.
Switching back to 0.16.2, even restoring the data directory from a backup
didn't help.

So, here's what we did:

  1. Delete the data directories completely (we had a backup elsewhere)
  2. Start up a clean es 0.17.7 and wait for green (no indices to wait for,
    of course)
  3. Issue create index / put mapping commands to recreate all our index
    definitions and mappings (same # of shards, same mappings, etc. as before)
  4. Shutdown the cluster and copy the data index directories only (not the
    _state directories) over from the backup
  5. Startup the cluster -- all indices came up green and had all our data!
    • Note that es seems to delete any index directories that don't match
      up with existing indices, so make sure you hang on the the backup until you
      are sure

On the next environment we tried this on, we first flushed the transaction
logs before shutting down the cluster and upgrading. Everything went
smoothly.
I don't know if flushing had anything to do with it, or if the first problem
was kind of a freak occurrence, but I thought I would mention it.

Curtis

this method will only work if it ends up with the same shard distribution
across the cluster. did any upgrade you tried from 0.16.2 to 0.17.7 caused
missing data?

On Fri, Sep 30, 2011 at 8:01 PM, Curtis Caravone caravone@gmail.com wrote:

Hey all,

I thought I would share an experience I had recently in case it helps
anybody having a similar problem:

We are in the process up upgrading from es 0.16.2 to the latest 0.17.7. We
did this by shutting down the cluster
and starting up the 0.17.7 instances pointing to our existing data
directories (we use local gateway).

We're not sure how it happened, but when the new cluster came up, it never
left red state and it showed no indices existing.
Switching back to 0.16.2, even restoring the data directory from a backup
didn't help.

So, here's what we did:

  1. Delete the data directories completely (we had a backup elsewhere)
  2. Start up a clean es 0.17.7 and wait for green (no indices to wait for,
    of course)
  3. Issue create index / put mapping commands to recreate all our index
    definitions and mappings (same # of shards, same mappings, etc. as before)
  4. Shutdown the cluster and copy the data index directories only (not the
    _state directories) over from the backup
  5. Startup the cluster -- all indices came up green and had all our data!
    • Note that es seems to delete any index directories that don't match
      up with existing indices, so make sure you hang on the the backup until you
      are sure

On the next environment we tried this on, we first flushed the transaction
logs before shutting down the cluster and upgrading. Everything went
smoothly.
I don't know if flushing had anything to do with it, or if the first
problem was kind of a freak occurrence, but I thought I would mention it.

Curtis

That's a good point. In our case, we are starting with three nodes and two
replicas, so every node has all the shards.

To answer your question:

The first one we tried was a single-node dev instance. It failed and we had
to rebuild the metatdata.

The second one we tried was also single-node, and it succeeded without a
problem.

The third one we tried was the three-node production instance. It had the
same failure as the first dev instance.

In the failure cases, there didn't seem to be anything in the logs, even at
trace level, except a "starting" message.

Curtis

On Sun, Oct 2, 2011 at 5:59 AM, Shay Banon kimchy@gmail.com wrote:

this method will only work if it ends up with the same shard distribution
across the cluster. did any upgrade you tried from 0.16.2 to 0.17.7 caused
missing data?

On Fri, Sep 30, 2011 at 8:01 PM, Curtis Caravone caravone@gmail.comwrote:

Hey all,

I thought I would share an experience I had recently in case it helps
anybody having a similar problem:

We are in the process up upgrading from es 0.16.2 to the latest 0.17.7.
We did this by shutting down the cluster
and starting up the 0.17.7 instances pointing to our existing data
directories (we use local gateway).

We're not sure how it happened, but when the new cluster came up, it never
left red state and it showed no indices existing.
Switching back to 0.16.2, even restoring the data directory from a backup
didn't help.

So, here's what we did:

  1. Delete the data directories completely (we had a backup elsewhere)
  2. Start up a clean es 0.17.7 and wait for green (no indices to wait for,
    of course)
  3. Issue create index / put mapping commands to recreate all our index
    definitions and mappings (same # of shards, same mappings, etc. as before)
  4. Shutdown the cluster and copy the data index directories only (not the
    _state directories) over from the backup
  5. Startup the cluster -- all indices came up green and had all our data!
    • Note that es seems to delete any index directories that don't match
      up with existing indices, so make sure you hang on the the backup until you
      are sure

On the next environment we tried this on, we first flushed the transaction
logs before shutting down the cluster and upgrading. Everything went
smoothly.
I don't know if flushing had anything to do with it, or if the first
problem was kind of a freak occurrence, but I thought I would mention it.

Curtis

Strange regarding the failure... . Can you recreate it in some way. I will
try myself to run an upgrade from 0.16.2 to 0.17.7 in different scenarios,
would help if you can try and pin point the steps taken to recreate it.

On Sun, Oct 2, 2011 at 6:25 PM, Curtis Caravone caravone@gmail.com wrote:

That's a good point. In our case, we are starting with three nodes and two
replicas, so every node has all the shards.

To answer your question:

The first one we tried was a single-node dev instance. It failed and we
had to rebuild the metatdata.

The second one we tried was also single-node, and it succeeded without a
problem.

The third one we tried was the three-node production instance. It had the
same failure as the first dev instance.

In the failure cases, there didn't seem to be anything in the logs, even at
trace level, except a "starting" message.

Curtis

On Sun, Oct 2, 2011 at 5:59 AM, Shay Banon kimchy@gmail.com wrote:

this method will only work if it ends up with the same shard distribution
across the cluster. did any upgrade you tried from 0.16.2 to 0.17.7 caused
missing data?

On Fri, Sep 30, 2011 at 8:01 PM, Curtis Caravone caravone@gmail.comwrote:

Hey all,

I thought I would share an experience I had recently in case it helps
anybody having a similar problem:

We are in the process up upgrading from es 0.16.2 to the latest 0.17.7.
We did this by shutting down the cluster
and starting up the 0.17.7 instances pointing to our existing data
directories (we use local gateway).

We're not sure how it happened, but when the new cluster came up, it
never left red state and it showed no indices existing.
Switching back to 0.16.2, even restoring the data directory from a backup
didn't help.

So, here's what we did:

  1. Delete the data directories completely (we had a backup elsewhere)
  2. Start up a clean es 0.17.7 and wait for green (no indices to wait
    for, of course)
  3. Issue create index / put mapping commands to recreate all our index
    definitions and mappings (same # of shards, same mappings, etc. as before)
  4. Shutdown the cluster and copy the data index directories only (not
    the _state directories) over from the backup
  5. Startup the cluster -- all indices came up green and had all our
    data!
    • Note that es seems to delete any index directories that don't
      match up with existing indices, so make sure you hang on the the backup
      until you are sure

On the next environment we tried this on, we first flushed the
transaction logs before shutting down the cluster and upgrading. Everything
went smoothly.
I don't know if flushing had anything to do with it, or if the first
problem was kind of a freak occurrence, but I thought I would mention it.

Curtis

Ok, I'll see what I can do in recreating it.

Curtis

On Sun, Oct 2, 2011 at 3:14 PM, Shay Banon kimchy@gmail.com wrote:

Strange regarding the failure... . Can you recreate it in some way. I will
try myself to run an upgrade from 0.16.2 to 0.17.7 in different scenarios,
would help if you can try and pin point the steps taken to recreate it.

On Sun, Oct 2, 2011 at 6:25 PM, Curtis Caravone caravone@gmail.comwrote:

That's a good point. In our case, we are starting with three nodes and
two replicas, so every node has all the shards.

To answer your question:

The first one we tried was a single-node dev instance. It failed and we
had to rebuild the metatdata.

The second one we tried was also single-node, and it succeeded without a
problem.

The third one we tried was the three-node production instance. It had the
same failure as the first dev instance.

In the failure cases, there didn't seem to be anything in the logs, even
at trace level, except a "starting" message.

Curtis

On Sun, Oct 2, 2011 at 5:59 AM, Shay Banon kimchy@gmail.com wrote:

this method will only work if it ends up with the same shard distribution
across the cluster. did any upgrade you tried from 0.16.2 to 0.17.7 caused
missing data?

On Fri, Sep 30, 2011 at 8:01 PM, Curtis Caravone caravone@gmail.comwrote:

Hey all,

I thought I would share an experience I had recently in case it helps
anybody having a similar problem:

We are in the process up upgrading from es 0.16.2 to the latest 0.17.7.
We did this by shutting down the cluster
and starting up the 0.17.7 instances pointing to our existing data
directories (we use local gateway).

We're not sure how it happened, but when the new cluster came up, it
never left red state and it showed no indices existing.
Switching back to 0.16.2, even restoring the data directory from a
backup didn't help.

So, here's what we did:

  1. Delete the data directories completely (we had a backup elsewhere)
  2. Start up a clean es 0.17.7 and wait for green (no indices to wait
    for, of course)
  3. Issue create index / put mapping commands to recreate all our index
    definitions and mappings (same # of shards, same mappings, etc. as before)
  4. Shutdown the cluster and copy the data index directories only (not
    the _state directories) over from the backup
  5. Startup the cluster -- all indices came up green and had all our
    data!
    • Note that es seems to delete any index directories that don't
      match up with existing indices, so make sure you hang on the the backup
      until you are sure

On the next environment we tried this on, we first flushed the
transaction logs before shutting down the cluster and upgrading. Everything
went smoothly.
I don't know if flushing had anything to do with it, or if the first
problem was kind of a freak occurrence, but I thought I would mention it.

Curtis

On Mon, 2011-10-03 at 00:14 +0200, Shay Banon wrote:

Strange regarding the failure... . Can you recreate it in some way. I
will try myself to run an upgrade from 0.16.2 to 0.17.7 in different
scenarios, would help if you can try and pin point the steps taken to
recreate it.

Might this not be the bug that was incorrectly adding delete-by-query to
the translogs?

clint

On Sun, Oct 2, 2011 at 6:25 PM, Curtis Caravone caravone@gmail.com
wrote:
That's a good point. In our case, we are starting with three
nodes and two replicas, so every node has all the shards.

    To answer your question:
    
    
    The first one we tried was a single-node dev instance.  It
    failed and we had to rebuild the metatdata.
    
    
    The second one we tried was also single-node, and it succeeded
    without a problem.
    
    
    The third one we tried was the three-node production
    instance.  It had the same failure as the first dev instance.
    
    
    In the failure cases, there didn't seem to be anything in the
    logs, even at trace level, except a "starting" message.
    
    
    Curtis
    
    
    
    
    On Sun, Oct 2, 2011 at 5:59 AM, Shay Banon <kimchy@gmail.com>
    wrote:
            this method will only work if it ends up with the same
            shard distribution across the cluster. did any upgrade
            you tried from 0.16.2 to 0.17.7 caused missing data?
            
            
            
            On Fri, Sep 30, 2011 at 8:01 PM, Curtis Caravone
            <caravone@gmail.com> wrote:
                    Hey all,
                    
                    
                    I thought I would share an experience I had
                    recently in case it helps anybody having a
                    similar problem:
                    
                    
                    We are in the process up upgrading from es
                    0.16.2 to the latest 0.17.7.  We did this by
                    shutting down the cluster
                    and starting up the 0.17.7 instances pointing
                    to our existing data directories (we use local
                    gateway).
                    
                    
                    We're not sure how it happened, but when the
                    new cluster came up, it never left red state
                    and it showed no indices existing.
                    Switching back to 0.16.2, even restoring the
                    data directory from a backup didn't help.
                    
                    
                    So, here's what we did:
                    
                    
                    1)  Delete the data directories completely (we
                    had a backup elsewhere)
                    2)  Start up a clean es 0.17.7 and wait for
                    green (no indices to wait for, of course)
                    3)  Issue create index / put mapping commands
                    to recreate all our index definitions and
                    mappings (same # of shards, same mappings,
                    etc. as before)
                    4)  Shutdown the cluster and copy the data
                    index directories only (not the _state
                    directories) over from the backup
                    5)  Startup the cluster -- all indices came up
                    green and had all our data!
                         * Note that es seems to delete any index
                    directories that don't match up with existing
                    indices, so make sure you hang on the the
                    backup until you are sure
                    
                    
                    On the next environment we tried this on, we
                    first flushed the transaction logs before
                    shutting down the cluster and upgrading.
                     Everything went smoothly.
                    I don't know if flushing had anything to do
                    with it, or if the first problem was kind of a
                    freak occurrence, but I thought I would
                    mention it.
                    
                    
                    Curtis