Translog don't decrease quickly enough

Dou · December 13, 2019, 2:53pm

Hi,

I'm currently upgrading from elastic 5.x to 7.x
I'm now on 6.8.5
I have create some new indices with reindex by extracting document from previous one to split doc by _type ( to keep only one type per index )

Each new indices take a lot of space in translog directory ( .tlog files )

I'm short in free space and want to commit translog :

[2019-12-13T15:45:01,111][INFO ][o.e.c.r.a.DiskThresholdMonitor] [Dpy6L8w] low disk watermark [85%] exceeded on [Dpy6L8wHR_WF4cUoxFeRVQ][Dpy6L8w][/home/log_cri/analyze/elasticsearch-data/nodes/0] free: 7.6gb[13.4%], replicas will not be assigned to this node

exemple :

373M    Ct5OUN7CRZ6RW6TWko_Pfg/1/translog
373M    Ct5OUN7CRZ6RW6TWko_Pfg/2/translog
373M    Ct5OUN7CRZ6RW6TWko_Pfg/4/translog
374M    Ct5OUN7CRZ6RW6TWko_Pfg/0/translog
374M    Ct5OUN7CRZ6RW6TWko_Pfg/3/translog

I have tried to flush, but translog remain high :

curl -X POST "localhost:9200/metrics-timer-2019.03/_flush?pretty"

I have also tried to forcemerge with and without only_expunge_deletes

curl -X POST "localhost:9200/_forcemerge?pretty&max_num_segments=1&only_expunge_deletes=true"

but translog stay same size

I have tried to downsize translog like this

[2019-12-13T15:25:25,703][INFO ][o.e.c.s.IndexScopedSettings] [Dpy6L8w] updating [index.translog.flush_threshold_size] from [512mb] to [64mb]

but again each translog stay at 374M

I have put number_of_replicas to 0

see _cat/indices and du below, size reported vs real space used :

curl -s "localhost:9200/_cat/indices?v" | grep Ct5OUN7CRZ6RW6TWko_Pfg
green open metrics-timer-2019.03 Ct5OUN7CRZ6RW6TWko_Pfg 5 0 2732578 0 724.9mb 724.9mb

elasticsearch-data/nodes/0/indices# du -shc Ct5OUN7CRZ6RW6TWko_Pfg
2,6G Ct5OUN7CRZ6RW6TWko_Pfg
2,6G total

I have try to stop and start elastic too ..

I think yesterday I was suffering the same problem and it was solved by itself during the night ..

But today I hit same problem .. and can't wait a night each time

I expecting 5s after the translog will be commited as default value for index.translog.sync_interval is 5s

What have I missed ?

See index param below :

curl -s "localhost:9200/metrics-timer-2019.03/" | json_pp

{
   "metrics-timer-2019.03" : {
      "mappings" : {
         "doc" : {
            "properties" : {
               "p99" : {
                  "type" : "float"
               },
               "m5" : {
                  "type" : "float"
               },
               "type" : {
                  "index" : false,
                  "type" : "keyword"
               },
               "@timestamp" : {
                  "type" : "date",
                  "format" : "dateOptionalTime"
               },
               "stddev" : {
                  "type" : "float"
               },
               "threadname" : {
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  },
                  "type" : "text"
               },
               "median" : {
                  "type" : "float"
               },
               "p999" : {
                  "type" : "float"
               },
               "max" : {
                  "type" : "float"
               },
               "class" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                     }
                  }
               },
               "origin" : {
                  "type" : "keyword"
               },
               "tags" : {
                  "type" : "keyword"
               },
               "m15" : {
                  "type" : "float"
               },
               "path" : {
                  "type" : "keyword"
               },
               "name" : {
                  "type" : "keyword"
               },
               "host" : {
                  "type" : "keyword"
               },
               "mean_rate" : {
                  "type" : "float"
               },
               "log_date" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "min" : {
                  "type" : "float"
               },
               "rate_unit" : {
                  "type" : "keyword"
               },
               "count" : {
                  "type" : "long"
               },
               "p95" : {
                  "type" : "float"
               },
               "message" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                     }
                  }
               },
               "duration_unit" : {
                  "type" : "keyword"
               },
               "mean" : {
                  "type" : "float"
               },
               "m1" : {
                  "type" : "float"
               },
               "loglevel" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "p75" : {
                  "type" : "float"
               },
               "process_time" : {
                  "type" : "date",
                  "format" : "dateOptionalTime"
               },
               "p98" : {
                  "type" : "float"
               },
               "stack" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "@version" : {
                  "type" : "keyword"
               }
            }
         }
      },
      "aliases" : {},
      "settings" : {
         "index" : {
            "number_of_shards" : "5",
            "creation_date" : "1576233729703",
            "version" : {
               "created" : "6080599"
            },
            "number_of_replicas" : "0",
            "provided_name" : "metrics-timer-2019.03",
            "uuid" : "Ct5OUN7CRZ6RW6TWko_Pfg",
            "translog" : {
               "flush_threshold_size" : "64mb"
            }
         }
      }
   }
}

Dou · December 13, 2019, 3:58pm

Also trying to decrease rentention age but nothing was freed

[2019-12-13T16:52:53,063][INFO ][o.e.c.s.IndexScopedSettings] [Dpy6L8w] updating [index.translog.retention.age] from [12h] to [30s]

Dou · December 13, 2019, 4:06pm

then setting
"translog" : { "retention" : { "size" : "64mb" } }

+ restart elastic finally shrink the transaction log ..

44K     Ct5OUN7CRZ6RW6TWko_Pfg/0/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/1/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/2/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/3/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/4/translog
220K    total

DavidTurner · December 13, 2019, 5:09pm

It should be sufficient to set the retention age to a short time, wait for that time to elapse, then run POST _flush (or maybe POST _flush?force). A restart shouldn't be necessary.

This is greatly improved by #45473 in 7.4.0: thanks to soft deletes, we no longer need to retain all this translog for peer recoveries.

Dou · December 16, 2019, 10:02am

Thanks David,

I juste have tried (always in 6.8.5) to see if just a short retention age + flush(?force) is sufficient and yes we must add force flag.

It seems force was not default in 6.8 and is true by default in 7.x

Can you explain why retention age is not enforced without explicitly call (force) flush ?
Or maybe not called often enough ?

I continue the upgrade path to 7.5 and will see it is more smooth.

I don't expect something from soft delete as the case is when i create new indices with reindex. The new indice have 0 delete and so I don't expect something from soft delete new feature

Thks

DavidTurner · December 16, 2019, 10:22am

Calling POST _flush doesn't normally do anything at all if you haven't indexed any documents, even if you have changed the retention policy. Overriding this check is what the ?force flag is for.

You will see smaller translogs in 7.4 whether you have deletions or not.

Dou · December 18, 2019, 11:58am

Now in 7.5
with this settings :

   "settings" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0,
      "index" : {
         "translog" : {
            "flush_threshold_size" : "64mb",
            "retention" : {
               "size" : "64mb",
               "age" : "30s"
            }
         }
      }
   }

it take 5 minutes after the reindex was done to shrink the translog from 36Mb (max 79Mb during reindex) to 12Kb (after 5min) (without call to _flush)

DavidTurner · December 18, 2019, 12:04pm

That is within my expectations - Elasticsearch performs a flush automatically on an index if it hasn't seen any indexing activity within the last 5 minutes.

Dou · December 18, 2019, 12:13pm

Thanks, so all is fine now in 7.5

Just to be complete,
is this 5 minutes timeout configurable ? Where ?

Many thanks for your support

DavidTurner · December 18, 2019, 12:18pm

If you want the translog to be cleaned up promptly at the end of a reindex then I think it's a better idea to flush it manually. Also note that the translog is a per-shard thing so if you have fewer shards in each index then the reindex will need less space.

system · January 15, 2020, 12:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Translog is too big on one of the nodes Elasticsearch	1	617	December 6, 2016
Elasticsearch 6.x: How to tune translog retention for small indices? Elasticsearch	1	368	June 10, 2019
Why my expired translog files are not deleted? Elasticsearch	5	1049	November 25, 2019
Version 6.3 _flush not flushing Elasticsearch	4	401	August 7, 2018
Index translog grows past the configured limit Elasticsearch	4	503	July 5, 2017

Translog don't decrease quickly enough

Related topics