I'm new one with Elasticsearch. I have a cluster 2 data-master nodes + 1 master node running in docker contaner. Yesterday one data node was killed by OOM killer, restarted and starded recovery process. This recovery process has not finished yet (70 Gb index 3 shards). I found that it recovers data from translog. At the same time elasticsearch process makes huge IO reads (1gB/s).
I'm using elasticsearch-6.2.4.
How can I fix it?
GET /logs-space-2018w16/_recovery?active_only
"logs-space-2018w16": {
"shards": [
{
"id": 1,
"type": "EXISTING_STORE",
"stage": "TRANSLOG",
"primary": true,
"start_time_in_millis": 1524230896638,
"total_time_in_millis": 80016368,
"source": {},
"target": {
"id": "5OtMMFjiQOaipOeRc5SHZA",
"host": "es-iva-1-common-storage-crm-v2.es-iva-1.stable.logs.company.net",
"transport_address": "[2a02:6b8:c0c:711d:0:1340:9d0c:62f8]:9300",
"ip": "2a02:6b8:c0c:711d:0:1340:9d0c:62f8",
"name": "crmlogs-es-iva-1-common-storage-crm-v2"
},
"index": {
"size": {
"total_in_bytes": 23945309257,
"reused_in_bytes": 23945309257,
"recovered_in_bytes": 0,
"percent": "100.0%"
},
"files": {
"total": 918,
"reused": 918,
"recovered": 0,
"percent": "100.0%"
},
"total_time_in_millis": 145,
"source_throttle_time_in_millis": 0,
"target_throttle_time_in_millis": 0
},
"translog": {
"recovered": 1108850,
"total": 7321709,
"percent": "15.1%",
"total_on_start": 7321709,
"total_time_in_millis": 80016192
},
"verify_index": {
"check_index_time_in_millis": 0,
"total_time_in_millis": 0
}
},
root@9d0c62f86efc:/# pidstat -dl 20
Linux 4.4.88-42 (9d0c62f86efc.net) 04/21/18 _x86_64_ (32 CPU)
11:43:33 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
11:43:53 1000 29 1031282.20 5.60 0.00 0 /usr/bin/java -Xms14500m -Xmx14500m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupanc
Furthermore, I found in logs interesting line until this incident
[2018-04-20T06:17:41,214][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [crmlogs-es-iva-1-common-storage-crm-v2] [logs-space-2018w16][0] merge segment [_3gn2] done: took [17.6h], [12.9 MB], [44,347 docs], [0s stopped], [0s throttled], [19.1 MB written], [Infinity MB/sec throttle]
UPD. After one day current index recovery is 15% =(
"translog": {
"recovered": 1099883,
"total": 7321709,
"percent": "15.0%",
"total_on_start": 7321709,
"total_time_in_millis": 57291164
},