Index recovery from translog takes infinity time

I'm new one with Elasticsearch. I have a cluster 2 data-master nodes + 1 master node running in docker contaner. Yesterday one data node was killed by OOM killer, restarted and starded recovery process. This recovery process has not finished yet (70 Gb index 3 shards). I found that it recovers data from translog. At the same time elasticsearch process makes huge IO reads (1gB/s).
I'm using elasticsearch-6.2.4.
How can I fix it?

GET /logs-space-2018w16/_recovery?active_only
"logs-space-2018w16": {
"shards": [
  {
    "id": 1,
    "type": "EXISTING_STORE",
    "stage": "TRANSLOG",
    "primary": true,
    "start_time_in_millis": 1524230896638,
    "total_time_in_millis": 80016368,
    "source": {},
    "target": {
      "id": "5OtMMFjiQOaipOeRc5SHZA",
      "host": "es-iva-1-common-storage-crm-v2.es-iva-1.stable.logs.company.net",
      "transport_address": "[2a02:6b8:c0c:711d:0:1340:9d0c:62f8]:9300",
      "ip": "2a02:6b8:c0c:711d:0:1340:9d0c:62f8",
      "name": "crmlogs-es-iva-1-common-storage-crm-v2"
    },
    "index": {
      "size": {
        "total_in_bytes": 23945309257,
        "reused_in_bytes": 23945309257,
        "recovered_in_bytes": 0,
        "percent": "100.0%"
      },
      "files": {
        "total": 918,
        "reused": 918,
        "recovered": 0,
        "percent": "100.0%"
      },
      "total_time_in_millis": 145,
      "source_throttle_time_in_millis": 0,
      "target_throttle_time_in_millis": 0
    },
    "translog": {
      "recovered": 1108850,
      "total": 7321709,
      "percent": "15.1%",
      "total_on_start": 7321709,
      "total_time_in_millis": 80016192
    },
    "verify_index": {
      "check_index_time_in_millis": 0,
      "total_time_in_millis": 0
    }
  },


root@9d0c62f86efc:/# pidstat -dl 20
Linux 4.4.88-42 (9d0c62f86efc.net)  04/21/18        _x86_64_        (32 CPU)

11:43:33      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
11:43:53     1000        29 1031282.20      5.60      0.00       0  /usr/bin/java -Xms14500m -Xmx14500m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupanc

Furthermore, I found in logs interesting line until this incident
[2018-04-20T06:17:41,214][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [crmlogs-es-iva-1-common-storage-crm-v2] [logs-space-2018w16][0] merge segment [_3gn2] done: took [17.6h], [12.9 MB], [44,347 docs], [0s stopped], [0s throttled], [19.1 MB written], [Infinity MB/sec throttle]

UPD. After one day current index recovery is 15% =(
"translog": {
"recovered": 1099883,
"total": 7321709,
"percent": "15.0%",
"total_on_start": 7321709,
"total_time_in_millis": 57291164
},

I'm trying to find more diagnostic information. For example GET _nodes/hot_threads
::: {crmlogs-es-iva-1-common-storage-crm-v2}{5OtMMFjiQOaipOeRc5SHZA}{yuxAXKf8QsqKPxFem2KxoQ}{es-iva-1-common-storage-crm-v2.es-iva-1.stable.logs.company.net}{[2a02:6b8:c0c:711d:0:1340:9d0c:62f8]:9300}
Hot threads at 2018-04-22T12:15:44.093Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

10.1% (50.4ms out of 500ms) cpu usage by thread 'elasticsearch[crmlogs-es-iva-1-common-storage-crm-v2][generic][T#1]'
 10/10 snapshots sharing following 33 elements
   java.nio.Bits.copyToArray(Bits.java:836)
   java.nio.DirectByteBuffer.get(DirectByteBuffer.java:280)
   org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
   org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
   org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:173)
   org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:507)
   org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.getDocID(PerThreadIDVersionAndSeqNoLookup.java:114)
   org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.lookupSeqNo(PerThreadIDVersionAndSeqNoLookup.java:134)
   org.elasticsearch.common.lucene.uid.VersionsAndSeqNoResolver.loadDocIdAndSeqNo(VersionsAndSeqNoResolver.java:159)
   org.elasticsearch.index.engine.InternalEngine.compareOpToLuceneDocBasedOnSeqNo(InternalEngine.java:680)
   org.elasticsearch.index.engine.InternalEngine.planIndexingAsNonPrimary(InternalEngine.java:944)
   org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:871)
   org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738)
   org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707)
   org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1245)
   org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:1274)
   org.elasticsearch.index.shard.IndexShard$$Lambda$1726/313643083.run(Unknown Source)
   org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:471)
   org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:393)
   org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:99)
   org.elasticsearch.index.shard.IndexShard.openIndexAndRecoveryFromTranslog(IndexShard.java:1332)
   org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:404)
   org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:94)
   org.elasticsearch.index.shard.StoreRecovery$$Lambda$1724/2015389504.run(Unknown Source)
   org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:300)
   org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:92)
   org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1607)
   org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$6(IndexShard.java:2039)
   org.elasticsearch.index.shard.IndexShard$$Lambda$1723/918485866.run(Unknown Source)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)

8.8% (43.9ms out of 500ms) cpu usage by thread 'elasticsearch[crmlogs-es-iva-1-common-storage-crm-v2][generic][T#3]'
 10/10 snapshots sharing following 33 elements
   java.nio.Bits.copyToArray(Bits.java:836)
   java.nio.DirectByteBuffer.get(DirectByteBuffer.java:280)
   org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
   org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
   ...
   java.lang.Thread.run(Thread.java:748)

8.6% (42.9ms out of 500ms) cpu usage by thread 'elasticsearch[crmlogs-es-iva-1-common-storage-crm-v2][generic][T#5]'
 10/10 snapshots sharing following 33 elements
   java.nio.Bits.copyToArray(Bits.java:836)
   java.nio.DirectByteBuffer.get(DirectByteBuffer.java:280)
   ...
   java.lang.Thread.run(Thread.java:748)

::: {crmlogs-es-master-1}{n7P_pGWVSyyBdH9E_tBVkg}{dO0eRaTqQ4O5rNENRhnB0g}{es-master-1.es-master.stable.logs.company.net}{[2a02:6b8:c12:2008:0:1340:bd63:e2d3]:9300}
  Hot threads at 2018-04-22T12:15:44.088Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {crmlogs-es-sas-1-common-storage-crm-v1}{bSHPZ9QjQE2tUzF3t7-48g}{3v2UTdW6QwyDjAnaDDmyTQ}{es-sas-1-common-storage-crm-v1.es-sas-1.stable.logs.company.net}{[2a02:6b8:c08:b113:0:1340:bc35:602b]:9300}
  Hot threads at 2018-04-22T12:15:44.094Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.