Missing event - file input plugin / NFS

Hi,

We have a problem with our Logstash Installation and I 'm a little stuck.

Initially, our logstash install was a v2.4 four nodes install. We recently spent some days to migrate towards 5.6.4 with some hope the problem will disapear. But it keeps going.
We see this problem with a file input, but I can't be sure the problem is located in this plugin or in the logstash core

We have several input like this one

file {
path => ["/mnt/logs_webcare_op1/webcare.log*"]
exclude => "*.gz"
type => "webcare_op"
add_field => { "@host" => "op_webcare_as1" }
start_position => "beginning"
sincedb_path => "/etc/logstash/states/webcare_op1.sincedb"
ignore_older => 172800
}

/mnt/logs_webcare_op1 is an NFS mount
The logs rotate daily. There is not file rotation when we can see anomalies.
There is only one pipeline worker
sincedb files are not shared between the inputs

A typical sample of the file :

2017-11-15 11:55:25,312 INFO [EnablerLogger:144] 743f669b-05d4-4dd2-adb1-37271cb79292|cae3c822-d765-477b-b4ce-4f3f6dcc441c|743f669b-05d4-4dd2-adb1-37271cb79292|API_REST|login|15/11/2017 11:55:25.0312|||PENDING|
...
2017-11-15 11:55:26,360 INFO [EnablerLogger:144] 743f669b-05d4-4dd2-adb1-37271cb79292|cae3c822-d765-477b-b4ce-4f3f6dcc441c|743f669b-05d4-4dd2-adb1-37271cb79292|API_REST|login|15/11/2017 11:55:25.0312|15/11/2017 11:55:26.0360|1048|SUCCESS|login=null

The problem is that we are missing some events.
It's a very little percentage (something like 20/1000000 events)

In the data sample for example we received the first line but not the second.
If I reinject the sample in Logstash, it's OK So I think it's not a probleme with the filters

The events are filtered and injected into a Rabbit Queue

if [@connector] == "webcare" and [@context] == "" {
rabbitmq {
key => "default"
exchange => "*******"
exchange_type => "direct"
user => "
"
password => "
"
host => "
****"
port => 5672
durable => true
persistent => true
vhost => "/"
}

I've added a second output to track the bug and to be sure that problem is not in Rabbit or later

output {
if [@connector] == "webcare" {
file {
path => "/var/log/logstash/debug-webcare-%{+YYYY-MM-dd}.log"
codec => line {
format => "%{@timestamp}, id %{@id}, reference %{uuid}, path %{path}"
}

In this debug-webcare file I have the first event but not the second

2017-11-15T15:55:25.312Z, id 86c8cb11-51f3-4797-b358-38d57b0ec07a, reference cae3c822-d765-477b-b4ce-4f3f6dcc441c, path /mnt/logs_webcare_op2/webcare.log

So I have to debug deeper but has anyone a direction, or an advise ? (I'm not a Ruby guy but I will learn if necessary)
How would you deal with this kind of problem ?

Franck

My first thoughts are:

What steps do you take to get the 20 out of 1000000 number? By using some count function or have you identified the actual 20 missing events?

Are you able to accurately reproduce the problem using the debug-webcare file technique with a small subset of lines?

Logstash has traditionally kept a number of events in memory, which could lead to data loss if you restart Logstash. In version 5.6.4 you can however enable persistent queues, which will prevent data loss if the Logstash process crashes or restarts. I would recommend enabling this feature and see if that changes things.

Well, I identify the ~20 events because they cause anomalies in our target platform, ES based.
Then I have a look in the webcare-debug file to see if theses events have been sent to Rabbit.

I'm not able to reproduce the probleme with a little subset, I'ts heavily random

Logstash has traditionally kept a number of events in memory, which could lead to data loss if you restart

Regarding the example in my post, it happened at 15h55. the previous reboot was at 15h22. Nothing in logstash log at 15h5*

But I will try the persistent queues.

my future tries include

  • enabling debugging in logstash to try to see where disappear the missing events (input/filter/output). I need to deal with the log volume for that

  • try to isolate this"webcare" logstash configuration. The filters are straight (grok/csv/mutate) but other configurations on the node are more compex filters and have statefull filters like aggregate filters.

Are you able to run all the original data through a test system and see the same results each time?

Are you able to identify the missing documents by seeing a gap in a sequence of identifiers or similar?

Hi Guy,

No, problem is difficultly reproducable. I have to wait hours to catch an anomaly.
The gap are not a single line in the original log file
In the example, the missing event timestamp was 11:55:26,360

It makes part of a missing bunch of 11 lines from 11:55:25.837 to 11:55:27.675

2017-11-15T15:55:25.837Z, id a1c06574-218c-461a-a452-52bc28f5a4cc, reference e160fa6b-51a7-47e0-9935-06cf8a7ab479, path /mnt/logs_webcare_op2/webcare.log
2017-11-15T15:55:27.675Z, id 229c9b51-8906-4181-8ad5-af852f82e03b, reference 2fc4512a-99b4-49e6-bff5-0c59b8a6b69a, path /mnt/logs_webcare_op2/webcare.log

Are you able to setup an alternative system that uses filebeat to read the files with the beats input instead?

I can make this setup, locally on the NFS mount.
It should help to localize the problem
Actually I'm running the setup in debug level (omg it's verbose) and waiting for an anomaly . It can take the day..
If it's not conclusise I will try the beats setup

I suggest filebeat because it the recommended tech now days. The LS file input has not had much attention in a long time. Both LS and Filebeat are OK to read a NFS mount - there some things to be aware of.

Also Jordan wrote this is in an internal memo:

My still-unclear understanding of this problem is follows: Reading and writing from the same file with two different NFS clients is not supported by NFS itself.

Can you put filebeat on remote machine?

I've catched an anomaly..
Regarding theses 2 lines (I removed some infos) :

2017-11-16 11:39:35,573 INFO [EnablerLogger:118] 05f7790f-a055-4e3d-bd75-44b82a30105b|05f7790f-a055-4e3d-bd75-44b82a30105b||WEBCARE|LOG_PCM|16/11/2017 11:39:35.0573|||PENDING|

2017-11-16 12:18:18,912 INFO [EnablerLogger:118] NA|05f7790f-a055-4e3d-bd75-44b82a30105b||WEBCARE|LOG_PCM|16/11/2017 11:39:35.0573|16/11/2017 12:18:18.0912|2323339|CANCEL|

The first has been lost, the second is OK

I can find a trace of the second in logstash inputs debug logs

logstash-plain-2017-11-16-14.log:[2017-11-16T12:19:08,684][DEBUG][logstash.inputs.file ] Received line {:path=>"/mnt/logs_webcare_op1/webcare.log", :text=>"2017-11-16 12:18:18,912 INFO [EnablerLogger:118] NA|05f7790f-a055-4e3d-bd75-44b82a30105b||WEBCARE|LOG_PCM|16/11/2017 11:39:35.0573|16/11/2017 12:18:18.0912|2323339|CANCEL|

Then in the output :

logstash-plain-2017-11-16-14.log:[2017-11-16T12:19:08,729][DEBUG][logstash.pipeline ] output received {"event"=>{"role"=>"WEBCARE", "@source"=>"2017-11-16 12:18:18,912 INFO [EnablerLogger:118] NA|05f7790f-a055-4e3d-bd75-44b82a30105b||WEBCARE|LOG_PCM|16/11/2017 11:39:35.0573|16/11/2017 12:18:18.0912|2323339|CANCEL...

But not trace at all of the first event...

But, if I dig a little deeper I can identify a line which timestamp should correspond to the loss :

[2017-11-16T11:40:22,059][DEBUG][logstash.inputs.file ] Received line {:path=>"/mnt/logs_webcare_op1/webcare.log", :text=>"\x00\x00\x00\x00\00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x.......MANYMANYMANY.......0\x00\x00\x0035.0681|||PENDING"}

So it should be a weird charencoding problem ?
Any idea ?

For Beats, it's not an easy task to deploy the filebeats on the remote. Theses are customer servers, there are firewall issues etc...
So the only thing I can try for the moment is a local install.

It's not the case for the 2017-11-16T11:40:22,059 log
But I can find many similar pattern just after a "file grew" line

[2017-11-16T11:30:05,342][DEBUG][logstash.inputs.file ] each: file grew: /mnt/logs_webcare_op1/webcare.log: old size 88153138, new size 88172605
[2017-11-16T11:30:05,348][DEBUG][logstash.inputs.file ] Received line {:path=>"/mnt/logs_webcare_op1/webcare.log", :text=>"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00.......\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x002017-11-16 11:29:16,714 INFO [EnablerLogger:135] 075fc7ba-529c-424f-842f-d1834f861f72|adc67770-c257-4f37-a101-5e10ffac9eee|075fc7ba-529c-424f-842f-d1834f861f72|IHM_WEBCARE|WebcareManageCustomerLocationServiceImpl.getSupportedCountries|16/11/2017 11:29:16.0714|||PENDING|"}

So..

As the OS are CentOS6 on both side, I think I run in this problem : https://access.redhat.com/solutions/22717
(I don't have RedHat account - so I don't see the full ticket)

Problem is known here : Logstash vs NFS (null characters)

This post is a good summary : https://serverfault.com/questions/707034/disable-file-cache-for-mount-point
Now I have to find a workaround, A local FileBeat will probably suffer in the same way.

Franck

Here is my workaround, actually running in production, seems ok

in watched_file.rb

15c15
< attr_accessor :close_older, :ignore_older, :delimiter---
> attr_accessor :close_older, :ignore_older, :delimiter, :strip_null_bytes
28a29
> @strip_null_bytes = false

In observing_tail.rb

78a79,91
> if line.include? "\000"
> if watched_file.strip_null_bytes
> @logger.debug? && @logger.debug("removing null bytes from #{watched_file.path} : #{line}")
> line = line.delete("\000")
> else
> @logger.debug? && @logger.debug("null bytes detected in #{watched_file.path}. Reloading file : #{line}")
> watched_file.strip_null_bytes = true
> watched_file.file_close()
> sleep(1)
> break;
> end
> end
> watched_file.strip_null_bytes = false

I'm not the only one facing this problem. During last years, you can find several forum post related to it.

Jordan, if you read this, don't you think this workaround (or a variant) deserves to be integrated in master.

If you want, I can package it better and activate it by a file input plugin parameter, then make a PR (I won't if you think id doesn't deserve)

Franck

What stops you from putting a Filebeat where the events are generated (far end machine)?

Using filebeat on the edge machines is what we are recommending nowadays.

I think I will go in this direction in next future but remote servers are not in my hands.
The actions after the question "hey can you deploy this RPM on frontends, with this config file, and open 5044 port in firewalls to logstash ?" can take weeks.

Anyways I know usecases where the data on the remote servers are themselves on NFS mounts,
which origin servers are some exotic systems (eg old hpux or solaris) where a filebeat deployment wouldn't be easy. So have a good support of NFS either in file input plugin or filebeat is not a bad idea

1 Like

Let me check with the beats team whether they already have similar code to your filewatch patch in place already.

Hi,

in filebeat we discourage usage of network shares. Some common problems with network shares:

  • inode (or windows volume + file ID) potentially changing (especially old CIFS implementations sharing between multiple OS types).
  • file unavailability due to network problems -> file state might be gc'ed if outage takes to long -> file might be send again
  • file contents being updated out of order on disc level

Out of order updates to the underlying file (or on network level) can lead to the OS inserting NULLs for missing pages, until all contents is available. This is due to the file pointer suddenly increasing by N-pages, without having contents for those N pages/blocks available yet. Any tool watching/following files will be able to read the missing area, as the total file size did increase by N+1 blocks. But tools will only get to read NULLs.

There is no support for backing off on NULL in filebeat yet. Potential solutions to the problem are (once NULL is detected):

  • just close the file and hope it is picked up again on next scan
  • use seek to return to last OK position in file and backoff before trying to read again.

Normally file inconsistencies don't persist for too long. Personally I'd prefer the seek-and-retry approach.

2 Likes

I understand totally that you discourage it, for all the good said reasons.
But sometimes you just don't have the choice, so if the ES stack should just be the best possible..

inode (or windows volume + file ID) potentially changing (especially old CIFS implementations sharing between multiple OS types).

Yep I got something like that on NFS (changing of device major number. there are workarounds) it's bad