Shard stuck in STORE TRANSLOG stage

My topbeat index has two shards 0, 1 which are both stuck at TRANSLOG stage for hours. CLuster state is red. Please help me troubleshoot this problem:

"id": 1,
  "type": "STORE",
  "stage": "TRANSLOG",
  "primary": true,
  "start_time": "2015-12-29T13:03:04.160Z",
  "start_time_in_millis": 1451394184160,
  "total_time": "1.9h",
  "total_time_in_millis": 6847526,
  "source": {
    "id": "RWxvylvLQcSF4T3RM6xG8A",
    "host": "10.35.132.142",
    "transport_address": "10.35.132.142:9300",
    "ip": "10.35.132.142",
    "name": "ec-dyl09026app03"
  },
  "target": {
    "id": "RWxvylvLQcSF4T3RM6xG8A",
    "host": "10.35.132.142",
    "transport_address": "10.35.132.142:9300",
    "ip": "10.35.132.142",
    "name": "ec-dyl09026app03"
  },
  "index": {
    "size": {
      "total": "296.9mb",
      "total_in_bytes": 311398655,
      "reused": "296.9mb",
      "reused_in_bytes": 311398655,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "100.0%"
    },
    "files": {
      "total": 88,
      "reused": 88,
      "recovered": 0,
      "percent": "100.0%"
    },
    "total_time": "26ms",
    "total_time_in_millis": 26,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": {
    "recovered": 379391,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "1.9h",
    "total_time_in_millis": 6847500
  },
  "verify_index": {
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

Check your ES logs.

What version are you on?

ES 2.1.0.

I tried many things:

  • Tried closing the index and reopening, no luck.
  • Tried deleting index and recreating, no luck. As soon as new shards are created, they get stuck.
  • Various shards started going into unassigned mode and never getting assigned, despite many hours.
  • I noticed the EC process consumed 100% CPU continuously for hours. So, they must have gone into some infinite loop.
  • New indexes started getting into STORE mode.
  • Tried rolling restart of EC nodes, did not help.
  • Tried stopping all EC nodes and starting them all in one shot. Did not help either.

All these started happening, when I tried to close a couple of old indices.

After many hours of struggle, I finally deleted the data folder. Now everything is running fine.

If you are interested, I can give you the logs. It should reveal issues with bulk closing of index causing the cluster to get into irrecoverable state.

If you kept track of everything it may be worth raising an issue on Github.