Hi, recently I caught very strange behavior of translog in my cluster:
I noticed that one node was executing task for 3 days like:
"node" : "PAHpy641TPqLYhPOKPnAvQ",
"id" : 1336600606,
"type" : "transport",
"action" : "internal:indices/flush/synced/pre",
"start_time_in_millis" : 1606122749080,
"running_time_in_nanos" : 251540520086768,
"cancellable" : false,
"headers" : { }
So I tried to reload this node with
service elasticsearch force-reload
to fix this issue, but this command was stuck too. Then I kill -9 all ES processes and started this node with service elasticsearch start.
Node succesfuly started and all shards were initialized exept one. I've tried to inspect this shard and got next information:
"id": 38,
"type": "PEER",
"stage": "TRANSLOG",
"primary": false,
"start_time": "2020-11-26T07:35:27.818Z",
"start_time_in_millis": 1606376127818,
"total_time": "17.6m",
"total_time_in_millis": 1060082,
"source": { -
"id": "l7cKKJOpTTy11odnaBJy-g",
"host": "10.101.0.112",
"transport_address": "10.101.0.112:9300",
"ip": "10.101.0.112",
"name": "nodename1"
},
"target": { -
"id": "PAHpy641TPqLYhPOKPnAvQ",
"host": "10.101.0.111",
"transport_address": "10.101.0.111:9300",
"ip": "10.101.0.111",
"name": "nodename2"
},
"index": { -
"size": { -
"total": "0b",
"total_in_bytes": 0,
"reused": "0b",
"reused_in_bytes": 0,
"recovered": "0b",
"recovered_in_bytes": 0,
"percent": "0.0%"
},
"files": { -
"total": 0,
"reused": 0,
"recovered": 0,
"percent": "0.0%"
},
"total_time": "5ms",
"total_time_in_millis": 5,
"source_throttle_time": "-1",
"source_throttle_time_in_millis": 0,
"target_throttle_time": "-1",
"target_throttle_time_in_millis": 0
},
"translog": { -
"recovered": 5967872,
"total": -1,
"percent": "-1.0%",
"total_on_start": -1,
"total_time": "17.6m",
"total_time_in_millis": 1060074
},
"verify_index": { -
"check_index_time": "0s",
"check_index_time_in_millis": 0,
"total_time": "0s",
"total_time_in_millis": 0
}
}
I noticed that the "percent": "-1.0%", is very strange, then I checked this shard on server and got this:
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 23 09:07 translog-92.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 11G Nov 23 09:07 translog-92.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 26 07:35 translog-93.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 116G Nov 25 14:26 translog-93.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 55 Nov 26 07:35 translog-94.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 88 Nov 26 07:35 translog.ckp
116G of the translog with limit 256mb on index. Here is index settings btw:
{ -
"indexname": { -
"settings": { -
"index": { -
"refresh_interval": "3000s",
"number_of_shards": "60",
"translog": { -
"flush_threshold_size": "256mb"
},
"provided_name": "indexname",
"creation_date": "1602579459530",
"unassigned": { -
"node_left": { -
"delayed_timeout": "10m"
}
},
"analysis": { -
"analyzer": { -
"lowercase_analyzer": { -
"filter": [ -
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"reverse_analyzer": { -
"filter": [ -
"reverse"
],
"type": "custom",
"tokenizer": "keyword"
},
"reverse_lowercase_analyzer": { -
"filter": [ -
"reverse",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1",
"uuid": "Enc5JIwLToqJUzq3HnSjoA",
"version": { -
"created": "7060299"
}
}
}
}
}
I can't fix this shard with relocating or another way so my cluster changed to Yellow state forever while there is broken index in it. We use long-term and large indexes so this translog issue is very critical.
How I can occure or fix this strange behaviour? Or maybe on later versions this bug was fixed? Or am I doing something wrong?
Thank you in advance.