Shard stuck in STORE TRANSLOG stage


(Omar Al Zabir) #1

My topbeat index has two shards 0, 1 which are both stuck at TRANSLOG stage for hours. CLuster state is red. Please help me troubleshoot this problem:

"id": 1,
  "type": "STORE",
  "stage": "TRANSLOG",
  "primary": true,
  "start_time": "2015-12-29T13:03:04.160Z",
  "start_time_in_millis": 1451394184160,
  "total_time": "1.9h",
  "total_time_in_millis": 6847526,
  "source": {
    "id": "RWxvylvLQcSF4T3RM6xG8A",
    "host": "10.35.132.142",
    "transport_address": "10.35.132.142:9300",
    "ip": "10.35.132.142",
    "name": "ec-dyl09026app03"
  },
  "target": {
    "id": "RWxvylvLQcSF4T3RM6xG8A",
    "host": "10.35.132.142",
    "transport_address": "10.35.132.142:9300",
    "ip": "10.35.132.142",
    "name": "ec-dyl09026app03"
  },
  "index": {
    "size": {
      "total": "296.9mb",
      "total_in_bytes": 311398655,
      "reused": "296.9mb",
      "reused_in_bytes": 311398655,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "100.0%"
    },
    "files": {
      "total": 88,
      "reused": 88,
      "recovered": 0,
      "percent": "100.0%"
    },
    "total_time": "26ms",
    "total_time_in_millis": 26,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": {
    "recovered": 379391,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "1.9h",
    "total_time_in_millis": 6847500
  },
  "verify_index": {
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

(Mark Walkom) #2

Check your ES logs.

What version are you on?


(Omar Al Zabir) #3

ES 2.1.0.

I tried many things:

  • Tried closing the index and reopening, no luck.
  • Tried deleting index and recreating, no luck. As soon as new shards are created, they get stuck.
  • Various shards started going into unassigned mode and never getting assigned, despite many hours.
  • I noticed the EC process consumed 100% CPU continuously for hours. So, they must have gone into some infinite loop.
  • New indexes started getting into STORE mode.
  • Tried rolling restart of EC nodes, did not help.
  • Tried stopping all EC nodes and starting them all in one shot. Did not help either.

All these started happening, when I tried to close a couple of old indices.

After many hours of struggle, I finally deleted the data folder. Now everything is running fine.

If you are interested, I can give you the logs. It should reveal issues with bulk closing of index causing the cluster to get into irrecoverable state.


(Mark Walkom) #4

If you kept track of everything it may be worth raising an issue on Github.


(system) #5