How to remove deleted elasticsearch logs from file system?

Hi everyone!

I have a problem: I run elasticsearch 6.2 on CentOS7 machine, and I see that my disk is filling up (by running "df" command).

When I check with the "du" command I don't see anything that could account for the disk usage.

When I run:

lsof | grep '(deleted)'

I see alot of elasticsearch gziped logs that elastic has deleted, but since elastic is still on - they are not removed from the file system, and they fill up the disk.
When stopping and restarting elasticsearch the files are removed.

My question is:
Considering I need elasticsearch to be on 24/7 in production, how can I make sure that elastic deletes these files from the file system?

I don't think this should happen with the default logging configuration, although it is possible with some configurations. Can you share your log4j2.properties?

I use the default settings:

status = error
logger.action.name = org.elasticsearch.action
logger.action.level = debug

appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] %marker%m%n

appender.rolling.type = RollingFile
appender.rolling.name = rolling
appender.rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}.log
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] %marker%.-10000m%n
appender.rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}-%d{yyyy-MM-dd}-%i.log.gz
appender.rolling.policies.type = Policies
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.time.modulate = true
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 128MB
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.fileIndex = nomax
appender.rolling.strategy.action.type = Delete
appender.rolling.strategy.action.basepath = ${sys:es.logs.base_path}
appender.rolling.strategy.action.condition.type = IfFileName
appender.rolling.strategy.action.condition.glob = ${sys:es.logs.cluster_name}-*
appender.rolling.strategy.action.condition.nested_condition.type = IfAccumulatedFileSize
appender.rolling.strategy.action.condition.nested_condition.exceeds = 2GB

rootLogger.level = info
rootLogger.appenderRef.console.ref = console
rootLogger.appenderRef.rolling.ref = rolling

appender.deprecation_rolling.type = RollingFile
appender.deprecation_rolling.name = deprecation_rolling
appender.deprecation_rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_deprecation.log
appender.deprecation_rolling.layout.type = PatternLayout
appender.deprecation_rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] %marker%.-10000m%n
appender.deprecation_rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_deprecation-%i.log.gz
appender.deprecation_rolling.policies.type = Policies
appender.deprecation_rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.deprecation_rolling.policies.size.size = 1GB
appender.deprecation_rolling.strategy.type = DefaultRolloverStrategy
appender.deprecation_rolling.strategy.max = 4

logger.deprecation.name = org.elasticsearch.deprecation
logger.deprecation.level = warn
logger.deprecation.appenderRef.deprecation_rolling.ref = deprecation_rolling
logger.deprecation.additivity = false

appender.index_search_slowlog_rolling.type = RollingFile
appender.index_search_slowlog_rolling.name = index_search_slowlog_rolling
appender.index_search_slowlog_rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_search_slowlog.log
appender.index_search_slowlog_rolling.layout.type = PatternLayout
appender.index_search_slowlog_rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c] %marker%.-10000m%n
appender.index_search_slowlog_rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_search_slowlog-%d{yyyy-MM-dd}.log
appender.index_search_slowlog_rolling.policies.type = Policies
appender.index_search_slowlog_rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.index_search_slowlog_rolling.policies.time.interval = 1
appender.index_search_slowlog_rolling.policies.time.modulate = true

logger.index_search_slowlog_rolling.name = index.search.slowlog
logger.index_search_slowlog_rolling.level = trace
logger.index_search_slowlog_rolling.appenderRef.index_search_slowlog_rolling.ref = index_search_slowlog_rolling
logger.index_search_slowlog_rolling.additivity = false

appender.index_indexing_slowlog_rolling.type = RollingFile
appender.index_indexing_slowlog_rolling.name = index_indexing_slowlog_rolling
appender.index_indexing_slowlog_rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_indexing_slowlog.log
appender.index_indexing_slowlog_rolling.layout.type = PatternLayout
appender.index_indexing_slowlog_rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c] %marker%.-10000m%n
appender.index_indexing_slowlog_rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}_index_indexing_slowlog-%d{yyyy-MM-dd}.log
appender.index_indexing_slowlog_rolling.policies.type = Policies
appender.index_indexing_slowlog_rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.index_indexing_slowlog_rolling.policies.time.interval = 1
appender.index_indexing_slowlog_rolling.policies.time.modulate = true

logger.index_indexing_slowlog.name = index.indexing.slowlog.index
logger.index_indexing_slowlog.level = trace
logger.index_indexing_slowlog.appenderRef.index_indexing_slowlog_rolling.ref = index_indexing_slowlog_rolling
logger.index_indexing_slowlog.additivity = false

Ok, this seems strange.

Can you show us the output from this? I tried to reproduce what you describe by configuring an Elasticsearch node to roll over its logs every second and to delete them shortly after and I do not see anything leaking:

$ diff -U2 elasticsearch-6.7.0/config-0/log4j2.properties elasticsearch-6.7.0/config-default/log4j2.properties
--- elasticsearch-6.7.0/config-0/log4j2.properties	2019-03-27 11:06:02.984388285 +0100
+++ elasticsearch-6.7.0/config-default/log4j2.properties	2019-03-21 16:34:42.000000000 +0100
@@ -15,5 +15,5 @@
 appender.rolling.layout.type = PatternLayout
 appender.rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] [%node_name]%marker %.-10000m%n
-appender.rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}-%d{yyyy-MM-dd-HH-m-s}-%i.log.gz
+appender.rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}-%d{yyyy-MM-dd}-%i.log.gz
 appender.rolling.policies.type = Policies
 appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
@@ -29,5 +29,5 @@
 appender.rolling.strategy.action.condition.glob = ${sys:es.logs.cluster_name}-*
 appender.rolling.strategy.action.condition.nested_condition.type = IfAccumulatedFileSize
-appender.rolling.strategy.action.condition.nested_condition.exceeds = 2048B
+appender.rolling.strategy.action.condition.nested_condition.exceeds = 2GB

 rootLogger.level = info
$ ls elasticsearch-6.7.0/logs/ -al
total 1176
drwxr-xr-x  2 davidturner davidturner   4096 Mar 27 11:11 .
drwxr-xr-x 10 davidturner davidturner   4096 Mar 27 10:53 ..
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:10 elasticsearch-2019-03-27-11-10-42-1.log.gz
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:10 elasticsearch-2019-03-27-11-10-45-1.log.gz
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:10 elasticsearch-2019-03-27-11-10-48-1.log.gz
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:10 elasticsearch-2019-03-27-11-10-51-1.log.gz
-rw-rw-r--  1 davidturner davidturner    312 Mar 27 11:10 elasticsearch-2019-03-27-11-10-54-1.log.gz
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:11 elasticsearch-2019-03-27-11-10-57-1.log.gz
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_access.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_audit.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_deprecation.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_index_indexing_slowlog.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_index_search_slowlog.log
-rw-rw-r--  1 davidturner davidturner    377 Mar 27 11:11 elasticsearch.log
-rw-rw-r--  1 davidturner davidturner 346546 Mar 27 11:11 gc.log
-rw-rw-r--  1 davidturner davidturner 281061 Mar 27 11:00 gc.log.00
-rw-rw-r--  1 davidturner davidturner 277315 Mar 27 11:04 gc.log.01
-rw-rw-r--  1 davidturner davidturner 257833 Mar 27 11:06 gc.log.02
$ sudo lsof | grep $ES_PID | grep 'deleted' | wc -l
85
$ # time passes
$ ls elasticsearch-6.7.0/logs/ -al
total 1184
drwxr-xr-x  2 davidturner davidturner   4096 Mar 27 11:11 .
drwxr-xr-x 10 davidturner davidturner   4096 Mar 27 10:53 ..
-rw-rw-r--  1 davidturner davidturner    312 Mar 27 11:10 elasticsearch-2019-03-27-11-10-54-1.log.gz
-rw-rw-r--  1 davidturner davidturner    313 Mar 27 11:11 elasticsearch-2019-03-27-11-10-57-1.log.gz
-rw-rw-r--  1 davidturner davidturner    312 Mar 27 11:11 elasticsearch-2019-03-27-11-11-0-1.log.gz
-rw-rw-r--  1 davidturner davidturner    312 Mar 27 11:11 elasticsearch-2019-03-27-11-11-3-1.log.gz
-rw-rw-r--  1 davidturner davidturner    311 Mar 27 11:11 elasticsearch-2019-03-27-11-11-6-1.log.gz
-rw-rw-r--  1 davidturner davidturner    312 Mar 27 11:11 elasticsearch-2019-03-27-11-11-9-1.log.gz
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_access.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_audit.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_deprecation.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_index_indexing_slowlog.log
-rw-rw-r--  1 davidturner davidturner      0 Mar 27 10:53 elasticsearch_index_search_slowlog.log
-rw-rw-r--  1 davidturner davidturner    377 Mar 27 11:11 elasticsearch.log
-rw-rw-r--  1 davidturner davidturner 352903 Mar 27 11:11 gc.log
-rw-rw-r--  1 davidturner davidturner 281061 Mar 27 11:00 gc.log.00
-rw-rw-r--  1 davidturner davidturner 277315 Mar 27 11:04 gc.log.01
-rw-rw-r--  1 davidturner davidturner 257833 Mar 27 11:06 gc.log.02
$ sudo lsof | grep $ES_PID | grep 'deleted' | wc -l
85

here:
its 130 lines and counting:

Plus, the general number of file that elastic uses is very big.
When I run:
lsof | grep elasticsearch | wc -l
I get ~30,000

The output you've shown is really just two file descriptors (1w and 2w) repeated once for each thread in Elasticsearch.

Are you using anything outside of Elasticsearch to manage your log rotation? I don't think Elasticsearch should ever open a file called what this one was called, so I think it might have been renamed out from underneath it.

I will check, but should it matter in the issue of too many open files that clog up my disk?
And doesn't this setting I shared is responsible for the log name format?

%marker%.-10000m%n appender.index_indexing_slowlog_rolling.filePattern = {sys:es.logs.base_path}{sys:file.separator}{sys:es.logs.cluster_name}_index_indexing_slowlog-%d{yyyy-MM-dd}.log

No, the Too many open files error is about running out of file descriptors, but as I said there's only two.

The line you shared is part of the rotation config for the slowlog. But I don't think that helps. What I think has happened is that something outside Elasticsearch has renamed the log file and then deleted it while Elasticsearch was still writing to it. This sounds like something an external log rotation process would do (e.g.logrotate).

I couldn't find any other program that mess with elastic log configuration.
The problem pressist: alot of deleted docs, and they fill my disk.
My problem is OOM and too many open files, and I can't delete all the files without stopping elastic entirly.
Is there a better way to handle this issue?

Your original post was about excessive disk usage. This is the first time you've mentioned an OOM. Also are you also seeing a Too many open files error? Can you share the full error message and stack trace? Can you use lsof to identify all the open files? So far we have only seen 2 open file descriptors.

Can you also share the output of GET _cluster/health?

Sorry, let me clarify:
I'm having memory trouble in my machine, programs are crushing due to memory shortage and too many open files errors.
The only think I can think of that is responsible for this is elasticsearch, which holds alot of open files at any given moment (~30-50K) and the disk is filling with deleted logs that don't get removed unless I stop elastic.
I want to use elastic in production, but if I don't understand how to stop elastic from filling my disk and opening so many files, how can I?

I do not think that open files are a cause of OOMs, and 30-50k open files isn't very many, particularly if you're looking at lsof which double-counts file descriptors as we've seen above. Let's see the output from GET _cluster/health as I asked above.

Here is the ouput for my local cluster:

{
  "cluster_name" : "CLUSTER1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 12,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 11,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 52.17391304347826
}

Hmm, ok, I was expecting more than 12 shards!

When Elasticsearch goes OOM it writes a heap dump. I think you'll need to open one of these (e.g. using MAT) to see what's consuming all the heap.

I will do that, but I'm troubled by the fack that elastic holds alot of memory by holding on to delted files, and the space the thus files are occupying doesn't get released until elastic is shut down.
Is there a way to combat it?

You haven't shown any evidence supporting this claim yet. So far you've shown Elasticsearch holding onto a single deleted file (repeated many times because that's how lsof works). From the filename, it looks like something other than Elasticsearch deleted it, so it's expected that it would remain open (because that's how Unix filesystems work). We also haven't seen anything demonstrating that this is related to the OOMs that you're reporting. I think you are jumping to conclusions at this point.

I just got this message in my elasticsearch log:

There is insufficient memory for the Java Runtime Environment to continue.
Cannot create GC thread. Out of system resources.
An error report file with more information is saved as:
/home/---/hs_err_pid31784.log
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.