Hi, recently I caught very strange behavior of translog in my cluster:
I noticed that one node was executing task for 3 days like:
"node" : "PAHpy641TPqLYhPOKPnAvQ",
"id" : 1336600606,
"type" : "transport",
"action" : "internal:indices/flush/synced/pre",
"start_time_in_millis" : 1606122749080,
"running_time_in_nanos" : 251540520086768,
"cancellable" : false,
"headers" : { }
So I tried to reload this node with
service elasticsearch force-reload
to fix this issue, but this command was stuck too. Then I kill -9
all ES processes and started this node with service elasticsearch start
.
Node succesfuly started and all shards were initialized exept one. I've tried to inspect this shard and got next information:
"id": 38,
"type": "PEER",
"stage": "TRANSLOG",
"primary": false,
"start_time": "2020-11-26T07:35:27.818Z",
"start_time_in_millis": 1606376127818,
"total_time": "17.6m",
"total_time_in_millis": 1060082,
"source": { -
"id": "l7cKKJOpTTy11odnaBJy-g",
"host": "10.101.0.112",
"transport_address": "10.101.0.112:9300",
"ip": "10.101.0.112",
"name": "nodename1"
},
"target": { -
"id": "PAHpy641TPqLYhPOKPnAvQ",
"host": "10.101.0.111",
"transport_address": "10.101.0.111:9300",
"ip": "10.101.0.111",
"name": "nodename2"
},
"index": { -
"size": { -
"total": "0b",
"total_in_bytes": 0,
"reused": "0b",
"reused_in_bytes": 0,
"recovered": "0b",
"recovered_in_bytes": 0,
"percent": "0.0%"
},
"files": { -
"total": 0,
"reused": 0,
"recovered": 0,
"percent": "0.0%"
},
"total_time": "5ms",
"total_time_in_millis": 5,
"source_throttle_time": "-1",
"source_throttle_time_in_millis": 0,
"target_throttle_time": "-1",
"target_throttle_time_in_millis": 0
},
"translog": { -
"recovered": 5967872,
"total": -1,
"percent": "-1.0%",
"total_on_start": -1,
"total_time": "17.6m",
"total_time_in_millis": 1060074
},
"verify_index": { -
"check_index_time": "0s",
"check_index_time_in_millis": 0,
"total_time": "0s",
"total_time_in_millis": 0
}
}
I noticed that the "percent": "-1.0%",
is very strange, then I checked this shard on server and got this:
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 23 09:07 translog-92.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 11G Nov 23 09:07 translog-92.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 26 07:35 translog-93.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 116G Nov 25 14:26 translog-93.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 55 Nov 26 07:35 translog-94.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 26 07:35 translog.ckp
116G of the translog with limit 256mb on index. Here is index settings btw:
{ -
"indexname": { -
"settings": { -
"index": { -
"refresh_interval": "3000s",
"number_of_shards": "60",
"translog": { -
"flush_threshold_size": "256mb"
},
"provided_name": "indexname",
"creation_date": "1602579459530",
"unassigned": { -
"node_left": { -
"delayed_timeout": "10m"
}
},
"analysis": { -
"analyzer": { -
"lowercase_analyzer": { -
"filter": [ -
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"reverse_analyzer": { -
"filter": [ -
"reverse"
],
"type": "custom",
"tokenizer": "keyword"
},
"reverse_lowercase_analyzer": { -
"filter": [ -
"reverse",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1",
"uuid": "Enc5JIwLToqJUzq3HnSjoA",
"version": { -
"created": "7060299"
}
}
}
}
}
I can't fix this shard with relocating or another way so my cluster changed to Yellow state forever while there is broken index in it. We use long-term and large indexes so this translog issue is very critical.
How I can occure or fix this strange behaviour? Or maybe on later versions this bug was fixed? Or am I doing something wrong?
Thank you in advance.