Elasticesearch 7.6.2 translog overflow issue

rambling · November 26, 2020, 1:54pm

Hi, recently I caught very strange behavior of translog in my cluster:
I noticed that one node was executing task for 3 days like:

      "node" : "PAHpy641TPqLYhPOKPnAvQ",
      "id" : 1336600606,
      "type" : "transport",
      "action" : "internal:indices/flush/synced/pre",
      "start_time_in_millis" : 1606122749080,
      "running_time_in_nanos" : 251540520086768,
      "cancellable" : false,
      "headers" : { }

So I tried to reload this node with
service elasticsearch force-reload
to fix this issue, but this command was stuck too. Then I kill -9 all ES processes and started this node with service elasticsearch start.
Node succesfuly started and all shards were initialized exept one. I've tried to inspect this shard and got next information:

  "id": 38,
  "type": "PEER",
  "stage": "TRANSLOG",
  "primary": false,
  "start_time": "2020-11-26T07:35:27.818Z",
  "start_time_in_millis": 1606376127818,
  "total_time": "17.6m",
  "total_time_in_millis": 1060082,
  "source": { - 
    "id": "l7cKKJOpTTy11odnaBJy-g",
    "host": "10.101.0.112",
    "transport_address": "10.101.0.112:9300",
    "ip": "10.101.0.112",
    "name": "nodename1"
  },
  "target": { - 
    "id": "PAHpy641TPqLYhPOKPnAvQ",
    "host": "10.101.0.111",
    "transport_address": "10.101.0.111:9300",
    "ip": "10.101.0.111",
    "name": "nodename2"
  },
  "index": { - 
    "size": { - 
      "total": "0b",
      "total_in_bytes": 0,
      "reused": "0b",
      "reused_in_bytes": 0,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "0.0%"
    },
    "files": { - 
      "total": 0,
      "reused": 0,
      "recovered": 0,
      "percent": "0.0%"
    },
    "total_time": "5ms",
    "total_time_in_millis": 5,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": { - 
    "recovered": 5967872,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "17.6m",
    "total_time_in_millis": 1060074
  },
  "verify_index": { - 
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

I noticed that the "percent": "-1.0%", is very strange, then I checked this shard on server and got this:

-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 23 09:07 translog-92.ckp
-rw-r--r-- 1 elasticsearch elasticsearch  11G Nov 23 09:07 translog-92.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 26 07:35 translog-93.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 116G Nov 25 14:26 translog-93.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   55 Nov 26 07:35 translog-94.tlog
-rw-r--r-- 1 elasticsearch elasticsearch   88 Nov 26 07:35 translog.ckp

116G of the translog with limit 256mb on index. Here is index settings btw:

{ - 
  "indexname": { - 
    "settings": { - 
      "index": { - 
        "refresh_interval": "3000s",
        "number_of_shards": "60",
        "translog": { - 
          "flush_threshold_size": "256mb"
        },
        "provided_name": "indexname",
        "creation_date": "1602579459530",
        "unassigned": { - 
          "node_left": { - 
            "delayed_timeout": "10m"
          }
        },
        "analysis": { - 
          "analyzer": { - 
            "lowercase_analyzer": { - 
              "filter": [ - 
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            },
            "reverse_analyzer": { - 
              "filter": [ - 
                "reverse"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            },
            "reverse_lowercase_analyzer": { - 
              "filter": [ - 
                "reverse",
                "lowercase"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "Enc5JIwLToqJUzq3HnSjoA",
        "version": { - 
          "created": "7060299"
        }
      }
    }
  }
}

I can't fix this shard with relocating or another way so my cluster changed to Yellow state forever while there is broken index in it. We use long-term and large indexes so this translog issue is very critical.
How I can occure or fix this strange behaviour? Or maybe on later versions this bug was fixed? Or am I doing something wrong?
Thank you in advance.

DavidTurner · November 26, 2020, 2:27pm

Can you share the index stats, GET /indexname/_stats?level=shards please?

rambling · November 28, 2020, 6:59am

Sure, here it is:

DavidTurner · November 28, 2020, 9:12am

Thanks. I suspect that whatever was blocking the synced flush caused the excessive translog size. It stopped being necessary to use synced flush in newer versions (see e.g. https://github.com/elastic/elasticsearch/pull/49126, https://github.com/elastic/elasticsearch/pull/50835) and your restart will have lost any context about why the synced flush was not completing.

I think it will help to manually force-flush this index with POST /certificate_v3/_flush?force. If successful, that will reduce the translog to zero on both shard copies which will avoid the slow recovery.

rambling · November 29, 2020, 7:34am

Next time I'll try to get all needed information before start to fix this kind of issues.
On your opinion - what can be the reason of slow flush? In my cluster I use nvme disks and this situation involves only 1 index and only 1 shard.
/_flush?force stucks for ~30 mins. Is it ok?
If I change index.translog.durability from request to async will it help to occure these situations in future?

DavidTurner · November 29, 2020, 7:51am

No, that sounds like the same problem. Is it still stuck? If so can you collect these outputs:

GET _tasks?detailed
GET _nodes/hot_threads?threads=99999
GET /indexname/_stats?level=shards

rambling · November 29, 2020, 8:07am

Still stuck, yes
Here is outputs:

DavidTurner · November 29, 2020, 8:36am

Thanks. This looks like a Lucene bug, possibly exacerbated by a bug in how you are generating doc IDs. The fix for the bug is included in 7.7.0 so you can upgrade to resolve it, but you should also check your client for problems.

   100.3% (501.7ms out of 500ms) cpu usage by thread 'elasticsearch[prod-es-data-10][refresh][T#124]'
     10/10 snapshots sharing following 33 elements
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:568)
       app//org.apache.lucene.index.FrozenBufferedUpdates.applyDocValuesUpdates(FrozenBufferedUpdates.java:451)
       app//org.apache.lucene.index.FrozenBufferedUpdates.apply(FrozenBufferedUpdates.java:421)
       app//org.apache.lucene.index.FrozenBufferedUpdates.forceApply(FrozenBufferedUpdates.java:249)
       app//org.apache.lucene.index.FrozenBufferedUpdates.tryApply(FrozenBufferedUpdates.java:159)
       app//org.apache.lucene.index.IndexWriter.lambda$publishFrozenUpdates$3(IndexWriter.java:2592)
       app//org.apache.lucene.index.IndexWriter$$Lambda$3760/0x00000008016ed040.process(Unknown Source)
       app//org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5116)
       app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:507)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
       app//org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       app//org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:165)
       app//org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:66)
       app//org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:40)
       app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       app//org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:344)
       app//org.elasticsearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:326)
       app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       app//org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:225)
       app//org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1611)

The usual instructions say to flush before upgrading but this won't work for obvious reasons: you'll need to skip this step. Quite possibly it won't be possible to take a snapshot before upgrading either for this reason. If you're doing a rolling upgrade then you should upgrade prod-es-data-10 first.

rambling · November 30, 2020, 12:42pm

Seems that I fixed this shard in some magic way - just change count of replicas to 0 and then back to 1. All translog on this shard is gone.
Also I stopped indexing in cluster and now preparing to rolling update to 7.10. Hope all will be smoothly.
Thanks for help.

DavidTurner · November 30, 2020, 1:08pm

That would do it, at the cost of temporarily reducing the redundancy for all of the shards in this index (risky) and needing to copy them all over the network again.

system · December 28, 2020, 1:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Translog don't decrease quickly enough Elasticsearch	10	702	January 15, 2020
Elasticsearch 2.1.1 is recovery can skip translog stage Elasticsearch	3	1067	July 5, 2017
ES 2.1 shards stuck in translog recovery Elasticsearch	14	6052	July 5, 2017
ES 2.0.1 - Getting huge Translogs size which is not getting cleared up in one of default five shards. Which finally causes Out Of Memory -All Shard Failure Elasticsearch	2	551	August 15, 2017
ES 2.1.1 translog grows forever Elasticsearch	2	863	July 5, 2017

Elasticesearch 7.6.2 translog overflow issue

Related topics