Did ES just nuke my index?


(orenmazor) #1

Hey all,

I think my cluster just nuked my main index. I noticed queries were not
coming back from my cluster. I happened to have bigdesk open, and noticed
the cpu was pegged on both node machines with load just climbing.

I killed the java process (which I've done quite a few times before… it
hasn't caused failure in tests before), but upon restart my logs were full
of this:

[2012-03-07 18:32:25,735][WARN ][cluster.action.shard ] [Suprema]
sending failed shard for [postmark_staging][2],
node[QpuhZarSRlWadRSaes4qAQ], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[postmark_staging][2]
shard allocated for local recovery (post api), should exists, but doesn't]]]

whats more, my cluster would not come up at all, so I couldn't admin it
with the API. the frontend was showing the red status, but not actually
giving me any option to act on it. I ended up moving the old node data
aside into a backup and upon restart, ES came back up happy (of course
without the index data…). I started my index rebuild while I debug the old
data, since I suspect I might have to do this anyway…

thoughts? can I recover this?

my system setup:

java:
min/max ram is 16gb (10gb on the other node)
SurvivorRatio is 6 (I found performance really improved once I tweaked this
up)

current index settings:

     "index.number_of_replicas": "0",
        "index.number_of_shards": "10",
        "index.merge.policy.segments_per_tier": "20",
        "index.refresh_interval": "-1"

now, its also worth mentioning that we do a fairly high rate of deletes and indexing (all bulk, around 500 every 5-10 seconds). in my previous index settings, I also tweaked the ratio of merge deletes per segment (its 10 by default, I experimented upping it to 50 in some cases, and settled on 40 eventually).


(Shay Banon) #2

Which version are you using?

On Thursday, March 8, 2012 at 2:27 AM, Oren Mazor wrote:

Hey all,

I think my cluster just nuked my main index. I noticed queries were not coming back from my cluster. I happened to have bigdesk open, and noticed the cpu was pegged on both node machines with load just climbing.

I killed the java process (which I've done quite a few times before… it hasn't caused failure in tests before), but upon restart my logs were full of this:

[2012-03-07 18:32:25,735][WARN ][cluster.action.shard ] [Suprema] sending failed shard for [postmark_staging][2], node[QpuhZarSRlWadRSaes4qAQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[postmark_staging][2] shard allocated for local recovery (post api), should exists, but doesn't]]]

whats more, my cluster would not come up at all, so I couldn't admin it with the API. the frontend was showing the red status, but not actually giving me any option to act on it. I ended up moving the old node data aside into a backup and upon restart, ES came back up happy (of course without the index data…). I started my index rebuild while I debug the old data, since I suspect I might have to do this anyway…

thoughts? can I recover this?

my system setup:

java:
min/max ram is 16gb (10gb on the other node)
SurvivorRatio is 6 (I found performance really improved once I tweaked this up)

current index settings:
"index.number_of_replicas": "0", "index.number_of_shards": "10", "index.merge.policy.segments_per_tier": "20", "index.refresh_interval": "-1"

now, its also worth mentioning that we do a fairly high rate of deletes and indexing (all bulk, around 500 every 5-10 seconds). in my previous index settings, I also tweaked the ratio of merge deletes per segment (its 10 by default, I experimented upping it to 50 in some cases, and settled on 40 eventually).


(orenmazor) #3

Hey Shay,

I'm running 0.18.6. After some googling, I saw there were issues with
previous versions, but those issues were closed, so I assumed it's not
related.

On Thursday, March 8, 2012 3:22:39 PM UTC-5, kimchy wrote:

Which version are you using?

On Thursday, March 8, 2012 at 2:27 AM, Oren Mazor wrote:

Hey all,

I think my cluster just nuked my main index. I noticed queries were not
coming back from my cluster. I happened to have bigdesk open, and noticed
the cpu was pegged on both node machines with load just climbing.

I killed the java process (which I've done quite a few times before… it
hasn't caused failure in tests before), but upon restart my logs were full
of this:

[2012-03-07 18:32:25,735][WARN ][cluster.action.shard ] [Suprema]
sending failed shard for [postmark_staging][2],
node[QpuhZarSRlWadRSaes4qAQ], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[postmark_staging][2]
shard allocated for local recovery (post api), should exists, but doesn't]]]

whats more, my cluster would not come up at all, so I couldn't admin it
with the API. the frontend was showing the red status, but not actually
giving me any option to act on it. I ended up moving the old node data
aside into a backup and upon restart, ES came back up happy (of course
without the index data…). I started my index rebuild while I debug the old
data, since I suspect I might have to do this anyway…

thoughts? can I recover this?

my system setup:

java:
min/max ram is 16gb (10gb on the other node)
SurvivorRatio is 6 (I found performance really improved once I tweaked
this up)

current index settings:

     "index.number_of_replicas": "0",
        "index.number_of_shards": "10",
        "index.merge.policy.segments_per_tier": "20",
        "index.refresh_interval": "-1"

now, its also worth mentioning that we do a fairly high rate of deletes and indexing (all bulk, around 500 every 5-10 seconds). in my previous index settings, I also tweaked the ratio of merge deletes per segment (its 10 by default, I experimented upping it to 50 in some cases, and settled on 40 eventually).


(Shay Banon) #4

There was a bug fixed in 0.19 that might cause this in very rare cases (the cluster state one).

On Friday, March 9, 2012 at 2:31 AM, Oren Mazor wrote:

Hey Shay,

I'm running 0.18.6. After some googling, I saw there were issues with previous versions, but those issues were closed, so I assumed it's not related.

On Thursday, March 8, 2012 3:22:39 PM UTC-5, kimchy wrote:

Which version are you using?

On Thursday, March 8, 2012 at 2:27 AM, Oren Mazor wrote:

Hey all,

I think my cluster just nuked my main index. I noticed queries were not coming back from my cluster. I happened to have bigdesk open, and noticed the cpu was pegged on both node machines with load just climbing.

I killed the java process (which I've done quite a few times before… it hasn't caused failure in tests before), but upon restart my logs were full of this:

[2012-03-07 18:32:25,735][WARN ][cluster.action.shard ] [Suprema] sending failed shard for [postmark_staging][2], node[QpuhZarSRlWadRSaes4qAQ], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[postmark_staging][2] shard allocated for local recovery (post api), should exists, but doesn't]]]

whats more, my cluster would not come up at all, so I couldn't admin it with the API. the frontend was showing the red status, but not actually giving me any option to act on it. I ended up moving the old node data aside into a backup and upon restart, ES came back up happy (of course without the index data…). I started my index rebuild while I debug the old data, since I suspect I might have to do this anyway…

thoughts? can I recover this?

my system setup:

java:
min/max ram is 16gb (10gb on the other node)
SurvivorRatio is 6 (I found performance really improved once I tweaked this up)

current index settings:
"index.number_of_replicas": "0", "index.number_of_shards": "10", "index.merge.policy.segments_per_tier": "20", "index.refresh_interval": "-1"

now, its also worth mentioning that we do a fairly high rate of deletes and indexing (all bulk, around 500 every 5-10 seconds). in my previous index settings, I also tweaked the ratio of merge deletes per segment (its 10 by default, I experimented upping it to 50 in some cases, and settled on 40 eventually).


(system) #5