Elasticesearch 7.6.2 translog overflow issue

Hi, recently I caught very strange behavior of translog in my cluster:
I noticed that one node was executing task for 3 days like:

      "node" : "PAHpy641TPqLYhPOKPnAvQ",
      "id" : 1336600606,
      "type" : "transport",
      "action" : "internal:indices/flush/synced/pre",
      "start_time_in_millis" : 1606122749080,
      "running_time_in_nanos" : 251540520086768,
      "cancellable" : false,
      "headers" : { }

So I tried to reload this node with
service elasticsearch force-reload
to fix this issue, but this command was stuck too. Then I kill -9 all ES processes and started this node with service elasticsearch start.
Node succesfuly started and all shards were initialized exept one. I've tried to inspect this shard and got next information:

  "id": 38,
  "type": "PEER",
  "stage": "TRANSLOG",
  "primary": false,
  "start_time": "2020-11-26T07:35:27.818Z",
  "start_time_in_millis": 1606376127818,
  "total_time": "17.6m",
  "total_time_in_millis": 1060082,
  "source": { - 
    "id": "l7cKKJOpTTy11odnaBJy-g",
    "host": "10.101.0.112",
    "transport_address": "10.101.0.112:9300",
    "ip": "10.101.0.112",
    "name": "nodename1"
  },
  "target": { - 
    "id": "PAHpy641TPqLYhPOKPnAvQ",
    "host": "10.101.0.111",
    "transport_address": "10.101.0.111:9300",
    "ip": "10.101.0.111",
    "name": "nodename2"
  },
  "index": { - 
    "size": { - 
      "total": "0b",
      "total_in_bytes": 0,
      "reused": "0b",
      "reused_in_bytes": 0,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "0.0%"
    },
    "files": { - 
      "total": 0,
      "reused": 0,
      "recovered": 0,
      "percent": "0.0%"
    },
    "total_time": "5ms",
    "total_time_in_millis": 5,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": { - 
    "recovered": 5967872,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "17.6m",
    "total_time_in_millis": 1060074
  },
  "verify_index": { - 
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

I noticed that the "percent": "-1.0%", is very strange, then I checked this shard on server and got this:

-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 23 09:07 translog-92.ckp
-rw-r--r-- 1 elasticsearch elasticsearch  11G Nov 23 09:07 translog-92.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 26 07:35 translog-93.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 116G Nov 25 14:26 translog-93.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   55 Nov 26 07:35 translog-94.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 26 07:35 translog.ckp

116G of the translog with limit 256mb on index. Here is index settings btw:

{ - 
  "indexname": { - 
    "settings": { - 
      "index": { - 
        "refresh_interval": "3000s",
        "number_of_shards": "60",
        "translog": { - 
          "flush_threshold_size": "256mb"
        },
        "provided_name": "indexname",
        "creation_date": "1602579459530",
        "unassigned": { - 
          "node_left": { - 
            "delayed_timeout": "10m"
          }
        },
        "analysis": { - 
          "analyzer": { - 
            "lowercase_analyzer": { - 
              "filter": [ - 
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            },
            "reverse_analyzer": { - 
              "filter": [ - 
                "reverse"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            },
            "reverse_lowercase_analyzer": { - 
              "filter": [ - 
                "reverse",
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "Enc5JIwLToqJUzq3HnSjoA",
        "version": { - 
          "created": "7060299"
        }
      }
    }
  }
}

I can't fix this shard with relocating or another way so my cluster changed to Yellow state forever while there is broken index in it. We use long-term and large indexes so this translog issue is very critical.
How I can occure or fix this strange behaviour? Or maybe on later versions this bug was fixed? Or am I doing something wrong?
Thank you in advance.

Can you share the index stats, GET /indexname/_stats?level=shards please?

Sure, here it is:

Thanks. I suspect that whatever was blocking the synced flush caused the excessive translog size. It stopped being necessary to use synced flush in newer versions (see e.g. https://github.com/elastic/elasticsearch/pull/49126, https://github.com/elastic/elasticsearch/pull/50835) and your restart will have lost any context about why the synced flush was not completing.

I think it will help to manually force-flush this index with POST /certificate_v3/_flush?force. If successful, that will reduce the translog to zero on both shard copies which will avoid the slow recovery.

Next time I'll try to get all needed information before start to fix this kind of issues.
On your opinion - what can be the reason of slow flush? In my cluster I use nvme disks and this situation involves only 1 index and only 1 shard.
/_flush?force stucks for ~30 mins. Is it ok?
If I change index.translog.durability from request to async will it help to occure these situations in future?

No, that sounds like the same problem. Is it still stuck? If so can you collect these outputs:

GET _tasks?detailed
GET _nodes/hot_threads?threads=99999
GET /indexname/_stats?level=shards

Still stuck, yes
Here is outputs:

Thanks. This looks like a Lucene bug, possibly exacerbated by a bug in how you are generating doc IDs. The fix for the bug is included in 7.7.0 so you can upgrade to resolve it, but you should also check your client for problems.

   100.3% (501.7ms out of 500ms) cpu usage by thread 'elasticsearch[prod-es-data-10][refresh][T#124]'
     10/10 snapshots sharing following 33 elements
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:568)
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:451)
       app//org.apache.lucene.index.FrozenBufferedUpdates.apply(FrozenBufferedUpdates.java:421)
       app//org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:249)
       app//org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:159)
       app//org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2592)
       app//org.apache.lucene.index.IndexWriter$$Lambda$3760/0x00000008016ed040.process(Unknown Source)
       app//org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5116)
       app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:507)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
       app//org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       app//org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165)
       app//org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:66)
       app//org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:40)
       app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       app//org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:344)
       app//org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:326)
       app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       app//org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)
       app//org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1611)

The usual instructions say to flush before upgrading but this won't work for obvious reasons: you'll need to skip this step. Quite possibly it won't be possible to take a snapshot before upgrading either for this reason. If you're doing a rolling upgrade then you should upgrade prod-es-data-10 first.

Seems that I fixed this shard in some magic way - just change count of replicas to 0 and then back to 1. All translog on this shard is gone.
Also I stopped indexing in cluster and now preparing to rolling update to 7.10. Hope all will be smoothly.
Thanks for help.

That would do it, at the cost of temporarily reducing the redundancy for all of the shards in this index (risky) and needing to copy them all over the network again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.