ES 2.1 shards stuck in translog recovery

jochen_st · November 27, 2015, 8:01am

We are running ES 2.1 single node cluster on Windows 7 / NTFS. After ES crashed due to faulty disk (replacement is under way), the indices being written to at time of crash are stuck in translog recovery phase. I will use the .marvel index as example here.

Output from curl -s "http://$ES_HOST:9200/_cat/recovery?v=h"

index                 shard time    type  stage    source_host target_host repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog
.marvel-es-2015.11.26 0     1209658 store translog 10.2.159.84 10.2.159.84 n/a        n/a      0     100.0%        0     100.0%        1           130         32660    -1.0%            -1

Output from curl -s "http://$ES_HOST:9200/.marvel-es-2015.11.26/_recovery?v=h"

  ".marvel-es-2015.11.26": {
    "shards": [
      {
        "verify_index": {
          "total_time_in_millis": 0,
          "check_index_time_in_millis": 0
        },
        "translog": {
          "total_time_in_millis": 3010971,
          "total_on_start": -1,
          "percent": "-1.0%",
          "total": -1,
          "recovered": 79466
        },
        "index": {
          "target_throttle_time_in_millis": 0,
          "source_throttle_time_in_millis": 0,
          "total_time_in_millis": 24,
          "files": {
            "percent": "100.0%",
            "recovered": 0,
            "reused": 1,
            "total": 1
          },
          "size": {
            "percent": "100.0%",
            "recovered_in_bytes": 0,
            "reused_in_bytes": 130,
            "total_in_bytes": 130
          }
        },
        "id": 0,
        "type": "STORE",
        "stage": "TRANSLOG",
        "primary": true,
        "start_time_in_millis": 1448608132857,
        "total_time_in_millis": 3010997,
        "source": {
          "name": "ELK0",
          "ip": "10.2.x.x",
          "transport_address": "10.2.x.x:9300",
          "host": "10.2.x.x",
          "id": "zuwWitsGS0uT9DVm9R4Fdw"
        },
        "target": {
          "name": "ELK0",
          "ip": "10.2.x.x",
          "transport_address": "10.2.x.x:9300",
          "host": "10.2.x.x",
          "id": "zuwWitsGS0uT9DVm9R4Fdw"
        }
      }
    ]
  }
}

The machine now has 100% CPU usage and no progress is seen in index recovery except the "time" and "translog" numbers rising slowly.

The translog folder looks like this:

 Verzeichnis von F:\XXX\nodes\0\indices\.marvel-es-2015.11.26\0\translog

27.11.2015  08:08    <DIR>          .
27.11.2015  08:08    <DIR>          ..
26.11.2015  19:23                20 translog-1.ckp
26.11.2015  19:23        80.677.779 translog-1.tlog 
27.11.2015  08:49        37.717.508 translog-2.tlog
27.11.2015  08:49                20 translog.ckp
               4 Datei(en),    118.395.327 Bytes

The translog-2.tlog file seems to be growing slowly.

While we believe the root cause has already been identified as faulty disk, we believe index recovery is far too slow, especially if we consider .marvel index has 1 file and 130 bytes of data according to the api.

jochen_st · November 27, 2015, 8:52am

well at least the .marvel index translog recovery came to an end, although it took nearly 1.5 hours.

.marvel-es-2015.11.26 0     5157291 store done     10.2.159.84 10.2.159.84 n/a        n/a      0     100.0%        0     100.0%        1           130         137838   -1.0%            -1

thomasmarkus · November 27, 2015, 10:39am

We've got similar issues on Dockerized elasicsearch 2.1.0. Where translog recovery pegs all CPUs for a bunch of hours.

Typical recovery info looks like this:

index                                              shard time    type    stage    source_host   target_host   repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog 
web-request-log-2015.09.21               0     2184250 store   translog 192.168.22.35 192.168.22.35 n/a        n/a      0     100.0%        0     100.0%        1           130         132688   -1.0%            -1

thomasmarkus · November 27, 2015, 10:52am

Further analysis shows that the translog recovery is terribly slow:


 "index" : {
        "size" : {
          "total_in_bytes" : 130,
          "reused_in_bytes" : 130,
          "recovered_in_bytes" : 0,
          "percent" : "100.0%"
        },
        "files" : {
          "total" : 1,
          "reused" : 1,
          "recovered" : 0,
          "percent" : "100.0%"
        },
        "total_time_in_millis" : 34,
        "source_throttle_time_in_millis" : 0,
        "target_throttle_time_in_millis" : 0
      },
      "translog" : {
        "recovered" : 24482,
        "total" : -1,
        "percent" : "-1.0%",
        "total_on_start" : -1,
        "total_time_in_millis" : 479400
      },

The recovered count increases at a pace of 100 p/s, that's a bit slow for an 8-core machine. It appears that each translog replay pegs the CPU at 100% until completed.

twirls · November 27, 2015, 1:15pm

I have also encountered the same problem with extreamly long recovery times (60 bytes per second):

    "translog": {
      "recovered": 100076,
      "total": -1,
      "percent": "-1.0%",
      "total_on_start": -1,
      "total_time_in_millis": 1634269
    },

bleskes · November 30, 2015, 3:51pm

After a node crushes (or the disk is full) lucene rolls back to it's last commit - the last place where it guaranteed to be consistent on disk. This corresponds to the last flush in ES. To compensate, Elasticsearch then replays all operations in the translog to bring lucene up to date. This is done on a single thread which is why you are seeing 100% cpu usage. The question in your case is why it took so long. It seems you have ~100K in your translog . It is some 80MB in size. If this takes too long, we can increase the frequency of flushes (to reduce the translog file size). See https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html#_flush_settings

jochen_st · December 1, 2015, 3:56pm

Thanks @bleskes for the explanation.

However I still think recovery operation is far too slow, since a normal translog flash on translog file with > 200 MB did finish within seconds on the same machine (I triggered it via index API / flush command).

emptyemail · December 4, 2015, 5:47pm

I had a similar issue, where my shards would take forever to relocate when the cluster is grown or shrunk, flushing all shards via /_flush solved the problem for me.

Doesn't it take make sense to flush the shard before trying to replicate it somewhere? I had shards with translogs over 1gig?

bleskes · December 6, 2015, 5:16pm

Flash doesn't replay the translog but rather commit any pending changes to lucene (there are already indexed, but not fsynced) . Once this is done, the translog can be trimmed because lucene persists the changes. It's not indicative of relocation performance.

@emptyemail (if this is your handle ) - the flush operations trims the translog, which means the replay is faster.

Note that on 1.6.0 we auto-flush (actually sync flush) after 5m of inactivity.

emptyemail · December 10, 2015, 5:30pm

@bleskes, correct. but once you issue _flush the translog is cleared. Also I am not sure auto-flush is happening, and it doesn't look like the default cap is working either. I found shards that had 1.5 gig translog, now everytime I am manually doing something that requires re-balancing i have to issue a _flush first.

bleskes · December 14, 2015, 8:16pm

@emptyemail which version of ES are you using? before 2.0 we would prevent flushing on the primary shard if there were ongoing recoveries (but flush as soon as they are done). 2.0 decouples the two and allows the engine to flush but still does keep the translog files required for the recovery.

emptyemail · December 14, 2015, 9:04pm

@bleskes, this was with version 2.0 not sure about 2.1 which I am running
right now, I've been flushing prior to doing any serious rebalancing

adrianocrestani · June 10, 2016, 4:05pm

I am seeing something similar. I was doing a rolling restart on an ES 2.1.1 cluster and one of the shards was taking a lot of time to initialize. I then did a cat on the recovery API and I saw that the translog phase on that shard was extremely slow.

$ curl -s -XGET 'els02ml.example.com:9200/_cat/recovery?v' | grep -v done
index                        shard time    type       stage    source_host target_host repository snapshot files files_percent bytes       bytes_percent total_files total_bytes translog translog_percent total_translog
locations_fr.2016_22         0     5324454 replica    translog 10.11.17.39 10.11.17.37 n/a        n/a      127   100.0%        2260844482  100.0%        151         4204901061  779035   91.7%            850001

I then did a query for hot threads and this is the hottest one by far:

::: {els02ml}{bZ-iuElsT7SH-6OOcFK4Fg}{10.11.17.37}{10.11.17.37:9300}{index_group=default, master=true}
   Hot threads at 2016-06-10T15:38:25.271Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   91.0% (455.1ms out of 500ms) cpu usage by thread 'elasticsearch[els02ml][generic][T#7]'
     2/10 snapshots sharing following 21 elements
       org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:196)
       org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:107)
       org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:112)
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:422)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:503)
       org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:377)
       org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:467)
       org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
       org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1254)
       org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:539)
       org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:465)
       org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:169)
       org.elasticsearch.index.shard.TranslogRecoveryPerformer.performBatchRecovery(TranslogRecoveryPerformer.java:87)
       org.elasticsearch.index.shard.IndexShard.performBatchRecovery(IndexShard.java:876)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:304)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:293)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
     3/10 snapshots sharing following 18 elements
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:422)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:503)
       org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:377)
       org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:467)
       org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
       org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1254)
       org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:539)
       org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:465)
       org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:169)
       org.elasticsearch.index.shard.TranslogRecoveryPerformer.performBatchRecovery(TranslogRecoveryPerformer.java:87)
       org.elasticsearch.index.shard.IndexShard.performBatchRecovery(IndexShard.java:876)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:304)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:293) 
...

Note: I am now seeing the same behaviour in another node on another shard for the same index.

PhaedrusTheGreek · September 1, 2016, 4:09pm

This is fixed in 2.1.2 / 2.2.0 - https://github.com/elastic/elasticsearch/issues/16206