Hey all,
I think my cluster just nuked my main index. I noticed queries were not
coming back from my cluster. I happened to have bigdesk open, and noticed
the cpu was pegged on both node machines with load just climbing.
I killed the java process (which I've done quite a few times before… it
hasn't caused failure in tests before), but upon restart my logs were full
of this:
[2012-03-07 18:32:25,735][WARN ][cluster.action.shard ] [Suprema]
sending failed shard for [postmark_staging][2],
node[QpuhZarSRlWadRSaes4qAQ], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[postmark_staging][2]
shard allocated for local recovery (post api), should exists, but doesn't]]]
whats more, my cluster would not come up at all, so I couldn't admin it
with the API. the frontend was showing the red status, but not actually
giving me any option to act on it. I ended up moving the old node data
aside into a backup and upon restart, ES came back up happy (of course
without the index data…). I started my index rebuild while I debug the old
data, since I suspect I might have to do this anyway…
thoughts? can I recover this?
my system setup:
java:
min/max ram is 16gb (10gb on the other node)
SurvivorRatio is 6 (I found performance really improved once I tweaked this
up)
current index settings:
"index.number_of_replicas": "0",
"index.number_of_shards": "10",
"index.merge.policy.segments_per_tier": "20",
"index.refresh_interval": "-1"
now, its also worth mentioning that we do a fairly high rate of deletes and indexing (all bulk, around 500 every 5-10 seconds). in my previous index settings, I also tweaked the ratio of merge deletes per segment (its 10 by default, I experimented upping it to 50 in some cases, and settled on 40 eventually).