Logstash pipeline fails to start with an existing persistent queue

Hi,

We've had some cases where something happens, our logstash pipeline crashes and attempts to restart, but is unable to start up again because it fails to create the persistent queue. When this happens, I have to go in and clear out the page and checkpoint files, and then the pipeline starts. We are using 4 workers, logstash version 5.6.3 via docker, and our pipeline has two jdbc inputs and two elasticsearch outputs.

This has happened a few times, and the logs look a bit different for each time. I'm just going to focus on the most recent case.

A listing of the queue directory when the pipeline was unable to start up:

ls -l
34 Nov 24 15:27 checkpoint.128
34 Nov 24 19:13 checkpoint.129
34 Nov 24 20:33 checkpoint.head
262144000 Nov 24 15:27 page.128
262144000 Nov 24 20:33 page.130

The checkpoint and head files were last modified at 20:33, which is when the pipeline crash happened. Data up until that time did make it into elasticsearch.

First, there was a stack trace:

IOError: java.io.IOException: computed checksum=-1147247917 != checksum for file=0
read at org/logstash/ackedqueue/io/AbstractByteBufferPageIO.java:241
readBatch at org/logstash/ackedqueue/Page.java:55
_readPageBatch at org/logstash/ackedqueue/Queue.java:531
readBatch at org/logstash/ackedqueue/Queue.java:522
read_batch at org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:163
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:68
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:68
read_next at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:260
read_next at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:260
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:172
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:172
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:171
read_batch at /usr/share/logstash/logstash-core/lib/logstash/util/wrapped_acked_queue.rb:171
worker_loop at /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:377
start_workers at /usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:342
run at java/lang/Thread.java:748

Logstash has crashed, attempts a restart, and we get this error:

2017-11-24T20:33:48,450][ERROR][logstash.pipeline ] Logstash failed to create queue exception=org.logstash.ackedqueue.io.AbstractByteBufferPageIO$PageIOInvalidElementException: Element seqNum 0 is expected to be 78089562, backtrace [org/logstash/ackedqueue/io/AbstractByteBufferPageIO.java:143:in `readNextElement', org/logstash/ackedqueue/io/AbstractByteBufferPageIO.java:81:in `open', org/logstash/ackedqueue/io/MmapPageIO.java:33:in `open', org/logstash/ackedqueue/Queue.java:206:in `open', org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:132:in `ruby_open', org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary$RubyAckedQueue .......

There's more to the error, but it's really long. Let me know if I should post the rest of it!

Any help is greatly appreciated! Thanks!

You should upgrade to 6.1.1 or 5.6.5. Around 5.6 we fixed quite a few bugs in the PQ code.

Got it upgraded to 5.6.5!

There have been a couple of times since the upgrade where the same thing has happened- logstash pipeline restarts itself for some reason but is unable to actually start up because it is unable to create the PQ. Again, I manually intervened, deleted the existing PQ files, and the pipeline started up.

In these 2 recent cases, I haven't found log messages to indicate why the pipeline crashed/ tried to restart itself in the first place... The error message that we got when we were unable to restart was this:

[2018-01-02T20:12:03,247][ERROR][logstash.pipeline ] Logstash failed to create queue
exception=java.io.EOFException: null, backtrace=
org/logstash/common/io/InputStreamStreamInput.java:19:in readByte', org/logstash/common/io/BufferedChecksumStreamInput.java:39:in readByte',
org/logstash/common/io/StreamInput.java:77:in readShort', org/logstash/ackedqueue/io/FileCheckpointIO.java:104:in read',
org/logstash/ackedqueue/io/FileCheckpointIO.java:53:in read', org/logstash/ackedqueue/Queue.java:179:in open',
org/logstash/ackedqueue/ext/JrubyAckedQueueExtLibrary.java:132:in `ruby_open'
...

The stack trace is again truncated.

I am going to get Colin involved. He has more recent experience with the PQ code and debugging it.

On more thing:
Are you using any unusual kind of disks?

Also, although the PQ is designed to survive a LS uncontrolled shutdown, we need to find out why your pipeline config is shutting down like this.

/cc @colinsurprenant

We are currently storing the queue to a GlusterFS mount. It's pretty slow, and could be the source of some problems.

I agree, I need to figure out why the pipeline is shutting down. Looking back at logs that I have for the past couple of weeks, I see that the pipeline had shut down and started up again (successfully) a few times per day.

I wonder whether we are trying to ingest some data that's messing things up... I don't know how to check that though, especially after the fact.

I'm guessing that you run LS as a service and the service manager does the restart.

What OS and OS version?

The crash messages could be being written to the service manager logs or DMESG, i.e. not the LS logs.

Hi,

First, I want to say that we never tested PQ on GlusterFS or any other network filesystems. PQ is really meant to be used on a local filesystem where the write durability and fsync'ing semantics are well understood. The checksum errors you are reporting seems to indicate corruption which would be compatible with unknown write/durability behaviours of a network filesystem.

It seems we are dealing with 2 kinds of problems here and I would suggest we eliminate some variables to pin down the source of the underlying crashes. For this I suggest we take out PQ of the equation for now, by using in-memory queuing while we perform these tests, but if PQ is mandatory, then I'd suggest you configure it to use a local filesystem.

Once this is done, I suggest we monitor LS behaviour to see if it crashes again and we take it from there? Does that make sense?

Also, would you let us know what environment is used for running LS? OS/version, virtualized? Java version

Colin

We are running LS in a docker container (based on image docker.elastic.co/logstash/logstash:5.6.5). The container is running in a container management platform (we use Rancher, specifically), which we have set to restart the container if it stops. We have used both docker version 1.12 and version 17.09, on CentOS 7.

We are planning to restructure our LS pipelines in the next few days, in the course of which we will switch to using a local filesystem for the PQ.

I'll update if there are any other crashes after we make the switch!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.