Translog don't decrease quickly enough

Hi,

I'm currently upgrading from elastic 5.x to 7.x
I'm now on 6.8.5
I have create some new indices with reindex by extracting document from previous one to split doc by _type ( to keep only one type per index )

Each new indices take a lot of space in translog directory ( .tlog files )

I'm short in free space and want to commit translog :

[2019-12-13T15:45:01,111][INFO ][o.e.c.r.a.DiskThresholdMonitor] [Dpy6L8w] low disk watermark [85%] exceeded on [Dpy6L8wHR_WF4cUoxFeRVQ][Dpy6L8w][/home/log_cri/analyze/elasticsearch-data/nodes/0] free: 7.6gb[13.4%], replicas will not be assigned to this node

exemple :

373M    Ct5OUN7CRZ6RW6TWko_Pfg/1/translog
373M    Ct5OUN7CRZ6RW6TWko_Pfg/2/translog
373M    Ct5OUN7CRZ6RW6TWko_Pfg/4/translog
374M    Ct5OUN7CRZ6RW6TWko_Pfg/0/translog
374M    Ct5OUN7CRZ6RW6TWko_Pfg/3/translog

I have tried to flush, but translog remain high :

curl -X POST "localhost:9200/metrics-timer-2019.03/_flush?pretty"

I have also tried to forcemerge with and without only_expunge_deletes

curl -X POST "localhost:9200/_forcemerge?pretty&max_num_segments=1&only_expunge_deletes=true"

but translog stay same size

I have tried to downsize translog like this

[2019-12-13T15:25:25,703][INFO ][o.e.c.s.IndexScopedSettings] [Dpy6L8w] updating [index.translog.flush_threshold_size] from [512mb] to [64mb]

but again each translog stay at 374M

I have put number_of_replicas to 0

see _cat/indices and du below, size reported vs real space used :

curl -s "localhost:9200/_cat/indices?v" | grep Ct5OUN7CRZ6RW6TWko_Pfg
green open metrics-timer-2019.03 Ct5OUN7CRZ6RW6TWko_Pfg 5 0 2732578 0 724.9mb 724.9mb

elasticsearch-data/nodes/0/indices# du -shc Ct5OUN7CRZ6RW6TWko_Pfg
2,6G Ct5OUN7CRZ6RW6TWko_Pfg
2,6G total

I have try to stop and start elastic too ..

I think yesterday I was suffering the same problem and it was solved by itself during the night ..

But today I hit same problem .. and can't wait a night each time :slight_smile:

I expecting 5s after the translog will be commited as default value for index.translog.sync_interval is 5s

What have I missed ?

See index param below :

curl -s "localhost:9200/metrics-timer-2019.03/" | json_pp

{
   "metrics-timer-2019.03" : {
      "mappings" : {
         "doc" : {
            "properties" : {
               "p99" : {
                  "type" : "float"
               },
               "m5" : {
                  "type" : "float"
               },
               "type" : {
                  "index" : false,
                  "type" : "keyword"
               },
               "@timestamp" : {
                  "type" : "date",
                  "format" : "dateOptionalTime"
               },
               "stddev" : {
                  "type" : "float"
               },
               "threadname" : {
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  },
                  "type" : "text"
               },
               "median" : {
                  "type" : "float"
               },
               "p999" : {
                  "type" : "float"
               },
               "max" : {
                  "type" : "float"
               },
               "class" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                     }
                  }
               },
               "origin" : {
                  "type" : "keyword"
               },
               "tags" : {
                  "type" : "keyword"
               },
               "m15" : {
                  "type" : "float"
               },
               "path" : {
                  "type" : "keyword"
               },
               "name" : {
                  "type" : "keyword"
               },
               "host" : {
                  "type" : "keyword"
               },
               "mean_rate" : {
                  "type" : "float"
               },
               "log_date" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "min" : {
                  "type" : "float"
               },
               "rate_unit" : {
                  "type" : "keyword"
               },
               "count" : {
                  "type" : "long"
               },
               "p95" : {
                  "type" : "float"
               },
               "message" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                     }
                  }
               },
               "duration_unit" : {
                  "type" : "keyword"
               },
               "mean" : {
                  "type" : "float"
               },
               "m1" : {
                  "type" : "float"
               },
               "loglevel" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "p75" : {
                  "type" : "float"
               },
               "process_time" : {
                  "type" : "date",
                  "format" : "dateOptionalTime"
               },
               "p98" : {
                  "type" : "float"
               },
               "stack" : {
                  "type" : "text",
                  "fields" : {
                     "keyword" : {
                        "ignore_above" : 256,
                        "type" : "keyword"
                     }
                  }
               },
               "@version" : {
                  "type" : "keyword"
               }
            }
         }
      },
      "aliases" : {},
      "settings" : {
         "index" : {
            "number_of_shards" : "5",
            "creation_date" : "1576233729703",
            "version" : {
               "created" : "6080599"
            },
            "number_of_replicas" : "0",
            "provided_name" : "metrics-timer-2019.03",
            "uuid" : "Ct5OUN7CRZ6RW6TWko_Pfg",
            "translog" : {
               "flush_threshold_size" : "64mb"
            }
         }
      }
   }
}

Also trying to decrease rentention age but nothing was freed

[2019-12-13T16:52:53,063][INFO ][o.e.c.s.IndexScopedSettings] [Dpy6L8w] updating [index.translog.retention.age] from [12h] to [30s]

then setting
"translog" : { "retention" : { "size" : "64mb" } }

+ restart elastic finally shrink the transaction log ..

44K     Ct5OUN7CRZ6RW6TWko_Pfg/0/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/1/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/2/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/3/translog
44K     Ct5OUN7CRZ6RW6TWko_Pfg/4/translog
220K    total

It should be sufficient to set the retention age to a short time, wait for that time to elapse, then run POST _flush (or maybe POST _flush?force). A restart shouldn't be necessary.

This is greatly improved by #45473 in 7.4.0: thanks to soft deletes, we no longer need to retain all this translog for peer recoveries.

Thanks David,

I juste have tried (always in 6.8.5) to see if just a short retention age + flush(?force) is sufficient and yes we must add force flag.

It seems force was not default in 6.8 and is true by default in 7.x

Can you explain why retention age is not enforced without explicitly call (force) flush ?
Or maybe not called often enough ?

I continue the upgrade path to 7.5 and will see it is more smooth.

I don't expect something from soft delete as the case is when i create new indices with reindex. The new indice have 0 delete and so I don't expect something from soft delete new feature

Thks

Calling POST _flush doesn't normally do anything at all if you haven't indexed any documents, even if you have changed the retention policy. Overriding this check is what the ?force flag is for.

You will see smaller translogs in 7.4 whether you have deletions or not.

Now in 7.5
with this settings :

   "settings" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0,
      "index" : {
         "translog" : {
            "flush_threshold_size" : "64mb",
            "retention" : {
               "size" : "64mb",
               "age" : "30s"
            }
         }
      }
   }

it take 5 minutes after the reindex was done to shrink the translog from 36Mb (max 79Mb during reindex) to 12Kb (after 5min) (without call to _flush)

That is within my expectations - Elasticsearch performs a flush automatically on an index if it hasn't seen any indexing activity within the last 5 minutes.

Thanks, so all is fine now in 7.5 :slight_smile:

Just to be complete,
is this 5 minutes timeout configurable ? Where ?

Many thanks for your support

If you want the translog to be cleaned up promptly at the end of a reindex then I think it's a better idea to flush it manually. Also note that the translog is a per-shard thing so if you have fewer shards in each index then the reindex will need less space.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.