Logstash 6.5.0 stops shipping logs to ES

We are running into an issue where the Logstash service is running but the logs are not being shipped to ES.

This issue is quite strange as it is random, we would start the logstash instance and after a few hours or sometimes days, Logstash would stop processing the logs and shipping them to ES.

Errors in Logstash.log (We have enabled debug)

[2019-04-04T15:13:16,993][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-04-04T15:13:17,196][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-04-04T15:13:17,196][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}

This all we are seeing in the logs at this point. Has anyone seen this issue?

Anyone has any idea on why this is happening? Any help would be greatly appreciated. We are not sure if this is the root cause of the reason why logs are not being shipped to ES but that’s the only indicator that we see if the logs write now.

I must also add that Logstash stopped processing events from the input file, it appears as if Logstash is stuck.

These DEBUG messages are just that. The cgroup related one goes away when LS is run in docker.
The others are saying which Java JVM Garbage Collector metrics are being collected.

What is your configuration?

@guyboertje, thanks for the reply much appreciated. Here are the details of my config.

Note: We are able to ship the logs with no issues for 2-3 days and suddenly the logs would stop getting shipped to ES and we have no way of determining what's causing the issue based on the logs.

Logstash.conf

    input {
	file {
		path => ["/pathto/log/mylog.log"]
		start_position => beginning
		sincedb_path => "/dev/null"
	}
}
filter {
	grok {
		match => { "message" => "\[%{TIMESTAMP_ISO8601}\]\[%{LOGLEVEL:loglevel}\]\[com.logging.LoggingServletFilter]\[%{DATA:mdc}\]\[%{DATA:thread}\]\[%{GREEDYDATA:exception}\]\[httptrace>(?<http>[^\t]+)\t\]" }
		add_field => { "logtype" => "httptrace" }
		add_field => { "class" => "com.logging.LoggingServletFilter" }
		remove_field => [ "message" ]
	}
	
	grok {
		match => { "message" => "\[%{TIMESTAMP_ISO8601}\]\[%{LOGLEVEL:loglevel}\]\[%{DATA:class}\]\[%{DATA:mdc}\]\[%{DATA:thread}\]\[%{GREEDYDATA:exception}\]\[%{GREEDYDATA:msg}\t\]" }
		add_field => { "logtype" => "codetrace" }
		remove_field => [ "message" ]
	}
	
	if [logtype] == "httptrace" {
		kv {
			source => "http"
			field_split => "|"
			target => "http_trace"
			value_split_pattern => ":\="
		}
		mutate {
			remove_field => [ "http" ]
			replace => { "msg" => "HTTP Trace" }
		}
	} 

	if [logtype] != "httptrace" and [logtype] != "codetrace" {
		drop { }
	}
	
	date {
		match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS", "ISO8601" ]
	}
}

output {
    amazon_es {
        hosts => ["endpoint"]
        region => "us-west-1"
        index => "env-%{+YYYY.MM.dd}"
		aws_access_key_id => "1111111111111111111111"
		aws_secret_access_key => "1111111111111111111111111"
    }
}

The file input has a lot more logging at the TRACE level but before you go and set the level to TRACE for all Logstash you should rather set levels dynamically for individual components at will. Here is a quote from an internal post that I made a while back. Please read it but don't do exactly what is written in this snippet (instructions for you after)...


The logging API allows for different levels of logging for different components in LS.

First do curl -XGET 'localhost:9600/_node/logging?pretty'
You see something like this:

{
  "host" : "Elastics-MacBook-Pro.local",
  "version" : "6.4.0",
  "http_address" : "127.0.0.1:9600",
  "id" : "8789409b-7126-4034-9347-de47e6ce12a9",
  "name" : "Elastics-MacBook-Pro.local",
  "loggers" : {
    "filewatch.discoverer" : "INFO",
    "filewatch.observingtail" : "INFO",
    "filewatch.sincedbcollection" : "INFO",
    "filewatch.tailmode.handlers.createinitial" : "INFO",
    "filewatch.tailmode.processor" : "INFO",
    "logstash.agent" : "INFO",
    "logstash.api.service" : "INFO",
    "logstash.codecs.json" : "INFO",
    ...
    "logstash.filters.grok" : "INFO",
    "logstash.filters.date" : "INFO",
    "logstash.inputs.file" : "INFO",
    ...
    "logstash.outputs.stdout" : "INFO",
    "logstash.pipeline" : "INFO",
    ...
    "slowlog.logstash.codecs.json" : "INFO",
    "slowlog.logstash.codecs.rubydebug" : "INFO",
    "slowlog.logstash.filters.date" : "INFO",
    "slowlog.logstash.inputs.file" : "INFO",
    "slowlog.logstash.outputs.stdout" : "INFO"
  }
}

Using the API
Turn DEBUG on for just the date and grok filters:

curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'
{
    "logstash.filters.date" : "DEBUG"
    "logstash.filters.grok" : "DEBUG"
}
'

Turn trace off:

curl -XPUT 'localhost:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'
{
    "logstash.filters.date" : "WARN"
    "logstash.filters.grok" : "WARN"
}
'

Or

curl -XPUT 'localhost:9600/_node/logging/reset?pretty'

NOTE: it might be a good idea to start LS with logging set to WARN in the logstash.yml so other logging is less verbose.


When you do the curl -XGET 'localhost:9600/_node/logging?pretty', you will see some components that start with filewatch - you should turn on TRACE logging for those only and only when you notice the file input failing to read new lines (before you restart Logstash).

Post the log lines at TRACE level here, I should be able to see what is going wrong from these. Note: if you turn on TRACE globally there will be too much logging and that makes the fault interpretation much much harder.

1 Like

Hi @guyboertje,
As suggested here is what I did...

Enabled Trace on on a few loggers.
curl -XPUT 'http://myserver.com:9600/_node/logging?pretty' -H 'Content-Type: application/json' -d'
{
"logger.filewatch.discoverer": "TRACE",
"logger.filewatch.observingtail": "TRACE",
"logger.filewatch.sincedbcollection": "TRACE",
"logger.filewatch.tailmode.handlers.createinitial": "TRACE",
"logger.filewatch.tailmode.handlers.grow": "TRACE",
"logger.filewatch.tailmode.handlers.timeout": "TRACE",
"logger.filewatch.tailmode.processor": "TRACE",
"logger.filewatch.tailmode.handlers.create": "TRACE",
"logger.filewatch.tailmode.handlers.shrink": "TRACE",
"logger.logstash.filters.date": "TRACE",
"logger.logstash.filters.drop": "TRACE",
"logger.logstash.filters.grok": "TRACE",
"logger.logstash.inputs.file": "TRACE"
}
'
Output from my logger API
{
"host": "myserver.com",
"version": "6.6.0",
"http_address": "mysever.com:9600",
"id": "44219357-5abf-4028-aa44-f955ded1d33b",
"name": "myserver.com",
"loggers": {
"filewatch.discoverer": "TRACE",
"filewatch.observingtail": "TRACE",
"filewatch.sincedbcollection": "TRACE",
"filewatch.tailmode.handlers.create": "TRACE",
"filewatch.tailmode.handlers.createinitial": "TRACE",
"filewatch.tailmode.handlers.grow": "TRACE",
"filewatch.tailmode.handlers.shrink": "TRACE",
"filewatch.tailmode.processor": "TRACE",
"logstash.agent": "DEBUG",
"logstash.api.service": "DEBUG",
"logstash.codecs.plain": "DEBUG",
"logstash.config.source.local.configpathloader": "DEBUG",
"logstash.config.source.multilocal": "DEBUG",
"logstash.config.sourceloader": "DEBUG",
"logstash.configmanagement.extension": "DEBUG",
"logstash.filters.date": "TRACE",
"logstash.filters.drop": "TRACE",
"logstash.filters.grok": "TRACE",
"logstash.filters.kv": "DEBUG",
"logstash.filters.mutate": "DEBUG",
"logstash.inputs.file": "TRACE",
"logstash.instrument.periodicpoller.cgroup": "DEBUG",
"logstash.instrument.periodicpoller.deadletterqueue": "DEBUG",
"logstash.instrument.periodicpoller.jvm": "DEBUG",
"logstash.instrument.periodicpoller.os": "DEBUG",
"logstash.instrument.periodicpoller.persistentqueue": "DEBUG",
"logstash.modules.scaffold": "DEBUG",
"logstash.modules.xpackscaffold": "DEBUG",
"logstash.monitoringextension": "DEBUG",
"logstash.monitoringextension.pipelineregisterhook": "DEBUG",
"logstash.outputs.elasticsearch": "DEBUG",
"logstash.pipeline": "DEBUG",
"logstash.plugins.registry": "DEBUG",
"logstash.runner": "DEBUG",
"logstash.setting.writabledirectory": "DEBUG",
"logstash.util.decorators": "DEBUG",
"org.logstash.Event": "DEBUG",
"org.logstash.FieldReference": "DEBUG",
"org.logstash.Logstash": "DEBUG",
"org.logstash.execution.AbstractPipelineExt": "DEBUG",
"org.logstash.execution.ShutdownWatcherExt": "DEBUG",
"org.logstash.execution.WorkerLoop": "DEBUG",
"org.logstash.filters.DateFilter": "DEBUG",
"org.logstash.instrument.metrics.gauge.LazyDelegatingGauge": "DEBUG",
"org.logstash.instrument.monitors.HotThreadsMonitor": "DEBUG",
"org.logstash.plugins.pipeline.PipelineBus": "DEBUG",
"org.logstash.secret.store.SecretStoreFactory": "DEBUG",
"slowlog.logstash.codecs.plain": "TRACE",
"slowlog.logstash.filters.date": "TRACE",
"slowlog.logstash.filters.drop": "TRACE",
"slowlog.logstash.filters.grok": "TRACE",
"slowlog.logstash.filters.kv": "TRACE",
"slowlog.logstash.filters.mutate": "TRACE",
"slowlog.logstash.inputs.file": "TRACE",
"slowlog.logstash.outputs.elasticsearch": "TRACE"
}
}

Logs are not showing much the same messages are still being shown.
[2019-04-10T16:00:07,690][DEBUG][logstash.api.service ] [api-service] start
[2019-04-10T16:00:07,690][DEBUG][logstash.api.service ] [api-service] start
[2019-04-10T16:00:07,691][DEBUG][logstash.api.service ] [api-service] start
[2019-04-10T16:00:07,691][DEBUG][logstash.api.service ] [api-service] start
[2019-04-10T16:00:07,691][DEBUG][logstash.api.service ] [api-service] start
[2019-04-10T16:00:07,747][DEBUG][logstash.agent ] API HTTP Request {:status=>200, :request_method=>"PUT", :path_info=>"/_node/logging", :query_string=>"pretty", :http_version=>"HTTP/1.1", :http_accept=>"/"}
[2019-04-10T16:00:10,689][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-04-10T16:00:10,689][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-04-10T16:00:11,201][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-04-10T16:00:15,804][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-04-10T16:00:15,804][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-04-10T16:00:16,209][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-04-10T16:00:20,941][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-04-10T16:00:20,941][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-04-10T16:00:21,214][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-04-10T16:00:26,061][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-04-10T16:00:26,097][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-04-10T16:00:26,315][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup,

Please let me know if you see anything that stands out. Once again thanks for taking the time to look into this issue.

@guyboertje, Kindly requesting.... did you get a chance to look at the logs that I provided. This seems to be a bug.... Even with Trace enabled we are not able to get much information.

Thanks for your time.

Bumping this thread... anyone able to provide some assistance?

What does your logstash.yml file look like?

Settings file in YAML

#
# Settings can be specified either in hierarchical form, e.g.:
#
#   pipeline:
#     batch:
#       size: 125
#       delay: 5
#
# Or as flat keys:
#
#   pipeline.batch.size: 125
#   pipeline.batch.delay: 5
#
# ------------  Node identity ------------
#
# Use a descriptive name for the node:
#
# node.name: test
#
# If omitted the node name will default to the machine's host name
#
# ------------ Data path ------------------
#
# Which directory should be used by logstash and its plugins
# for any persistent needs. Defaults to LOGSTASH_HOME/data
#
# path.data:
#
# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
#
# pipeline.id: main
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#
# pipeline.workers: 2
#
# How many events to retrieve from inputs before sending to filters+workers
#
# pipeline.batch.size: 125
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
# pipeline.batch.delay: 50
#
# Force Logstash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: enabling this can lead to data loss during shutdown
#
# pipeline.unsafe_shutdown: false
#
# ------------ Pipeline Configuration Settings --------------
#
# Where to fetch the pipeline configuration for the main pipeline
#
# path.config:
#
# Pipeline configuration string for the main pipeline
#
# config.string:
#
# At startup, test if the configuration is valid and exit (dry run)
#
# config.test_and_exit: false
#
# Periodically check if the configuration has changed and reload the pipeline
# This can also be triggered manually through the SIGHUP signal
#
# config.reload.automatic: false
#
# How often to check if the pipeline configuration has changed (in seconds)
#
# config.reload.interval: 3s
#
# Show fully compiled configuration as debug log message
# NOTE: --log.level must be 'debug'
#
# config.debug: false
#
# When enabled, process escaped characters such as \n and \" in strings in the
# pipeline configuration files.
#
# config.support_escapes: false
#
# ------------ Module Settings ---------------
# Define modules here.  Modules definitions must be defined as an array.
# The simple way to see this is to prepend each `name` with a `-`, and keep
# all associated variables under the `name` they are associated with, and
# above the next, like this:
#
# modules:
#   - name: MODULE_NAME
#     var.PLUGINTYPE1.PLUGINNAME1.KEY1: VALUE
#     var.PLUGINTYPE1.PLUGINNAME1.KEY2: VALUE
#     var.PLUGINTYPE2.PLUGINNAME1.KEY1: VALUE
#     var.PLUGINTYPE3.PLUGINNAME3.KEY1: VALUE
#
# Module variable names must be in the format of
#
# var.PLUGIN_TYPE.PLUGIN_NAME.KEY
#
# modules:
#
# ------------ Cloud Settings ---------------
# Define Elastic Cloud settings here.
# Format of cloud.id is a base64 value e.g. dXMtZWFzdC0xLmF3cy5mb3VuZC5pbyRub3RhcmVhbCRpZGVudGlmaWVy
# and it may have an label prefix e.g. staging:dXMtZ...
# This will overwrite 'var.elasticsearch.hosts' and 'var.kibana.host'
# cloud.id: <identifier>
#
# Format of cloud.auth is: <user>:<pass>
# This is optional
# If supplied this will overwrite 'var.elasticsearch.username' and 'var.elasticsearch.password'
# If supplied this will overwrite 'var.kibana.username' and 'var.kibana.password'
# cloud.auth: elastic:<password>
#
# ------------ Queuing Settings --------------
#
# Internal queuing model, "memory" for legacy in-memory based queuing and
# "persisted" for disk-based acked queueing. Defaults is memory
#
# queue.type: memory
#
# If using queue.type: persisted, the directory path where the data files will be stored.
# Default is path.data/queue
#
# path.queue:
#
# If using queue.type: persisted, the page data files size. The queue data consists of
# append-only data files separated into pages. Default is 64mb
#
# queue.page_capacity: 64mb
#
# If using queue.type: persisted, the maximum number of unread events in the queue.
# Default is 0 (unlimited)
#
# queue.max_events: 0
#
# If using queue.type: persisted, the total capacity of the queue in number of bytes.
# If you would like more unacked events to be buffered in Logstash, you can increase the
# capacity using this setting. Please make sure your disk drive has capacity greater than
# the size specified here. If both max_bytes and max_events are specified, Logstash will pick
# whichever criteria is reached first
# Default is 1024mb or 1gb
#
# queue.max_bytes: 1024mb
#
# If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.acks: 1024
#
# If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.writes: 1024
#
# If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
# Default is 1000, 0 for no periodic checkpoint.
#
# queue.checkpoint.interval: 1000
#
# ------------ Dead-Letter Queue Settings --------------
# Flag to turn on dead-letter queue.
#
# dead_letter_queue.enable: false

# If using dead_letter_queue.enable: true, the maximum size of each dead letter queue. Entries
# will be dropped if they would increase the size of the dead letter queue beyond this setting.
# Default is 1024mb
# dead_letter_queue.max_bytes: 1024mb

# If using dead_letter_queue.enable: true, the directory path where the data files will be stored.
# Default is path.data/dead_letter_queue
#
# path.dead_letter_queue:
#
# ------------ Metrics Settings --------------
#
# Bind address for the metrics REST endpoint
#
 http.host: "myserver.com"
#
# Bind port for the metrics REST endpoint, this option also accept a range
# (9600-9700) and logstash will pick up the first available ports.
#
http.port: 9600

That looks OK. Is there anything in the Elasticsearch logs? Are you monitoring Elasticsearch? If so, is there anything in the stats that corresponds in time to Logstash stopping to send logs?

There is nothing that shows anything strange. We know that the configurations and the logstash.conf works as expected because we are able to process large amount of logs, but the concern is with the frequent stop of processing by logstash. Unfortunately the logs are not telling me much...

If you have not already I would add debug logging for the amazon_es output plugin to see if this is encountering any issues around connectivity or failing events that is causing a periodic problem.

@stash2logs

There are no TRACE lines in the logging that you posted. If done correctly there will be thousands of lines of TRACE level logging from the components whose names the begin with filewatch.

Logstash is a synchronous system and the file input is single threaded. Meaning that if the file input thread is not logging any lines it is stuck waiting for the queue to be emptied. If the workers are stuck trying to send data via amazon_es then they can't empty the queue.

Unfortunately there does not seem to be any logging from the output that gives us a clue to the cause.

@guyboertje, I’m not sure i understand the statement if done correctly. As pasted above we have enabled TRACE for all the for all the filewatch settings and it seems to me that the reason we are not seeing any trace is the logs could be a bug or there are no filewatch activities to be traced. Right now we have kept the system in the stuck state to be able to troubleshoot and have not restarted Logstash. Anything else that you can think of? Any other trace to enable?

I will enable trace on the amazon_es plug-ins. Would you happen to know how to do that? Would setting the output plugins to Trace suffice?

You rightly feel I said you did not do the procedure correctly. I apologise, it was not my intention.

The lack of TRACE level lines coming from filewatch.* can have one of two sources, 1) incorrect logging level setting procedure 2) a stuck input thread. I was not sure which one of these is at play here. If you say that you did them correctly then it must be a stuck input thread.

Unfortunately, not all Logstash plugins actually log messages at the TRACE level. I did so with the filewatch components because there was some very difficult file rotation handling to get correct and I needed a way to exactly see how components were reacting. I don't think the amazon_es output has any TRACE output and you have already shown the DEBUG level lines and we are not seeing warnings or errors coming from that component.

(Due to the asynchronous file rotation mechanisms that occur at any time by external (OS) actors, it is hard to simulate and test.)

No worries @guyboertje, no need to apologize. You mentioned the issue could be related to a stuck input thread, any idea of What could be the cause of a stuck input thread in the context of logstash?

One other thing that I wanted to point out is the fact that we have two servers server1 and server2 pushing logs to the same aws es instance. I noticed that both servers stop shipping logs with three minutes among them.
server 1 - 7:48
server 2 - 7:51

Logstash is a synchronous system and the file input is single threaded. Meaning that if the file input thread is not logging any lines it is stuck waiting for the queue to be emptied. If the workers are stuck trying to send data via amazon_es then they can't empty the queue.

The file input thread is in a loop
Read from file
Hand to codec and wait...
Codec either buffers the data and returns or creates an events and tries to put it in the queue if it can't then it waits for a "slot", when it gets one it hands off the event and returns.
returns from codec and loops to read more.

The worker threads are in a loop
Initially, they all wait at the queue to accept events (given by the codec) only one worker thread takes from the queue at a time until a "batch" is taken.
The next worker thread starts taking events.
The original worker executes the filters with the batch and moves on to the output.
If the worker thread can successfully output the batch on it then loops around to the queue.
If the worker thread can't output the batch then it retries (to avoid data loss) in a sleep/attempt loop. If all worker threads are in this sleep attempt loop there are no workers to accept events from the input(s) and the inputs are said to be stuck.

I don't what version, but the amazon_es code looks like being based on an older version of the logstash elasticsearch output - from around June 2018. There are a few bugs that have been fixed since then. IMO, Amazon should have sent us patches instead of maintaining an out-of-date fork.

I suggest you restart Logstash to allow the outputs to reconnect. The file input should start reading from the end of the last successful event message - i.e. re-read and resend all in-flight data.

Thanks for this thorough update. The issue here is the files get rotated, so I restart Logstash right now the input file would have already been rolled over. This seems to be a more complex issue than I thought....