Grokparsefailure randomly

Hello,

I have a strange problem with my ELK setup. I'm using this docker container https://hub.docker.com/r/sebp/elk/ with the current version 651.
So I'm using logstash version 6.5.1

I wrote a small script which copies the logfiles each hour over ssh to the ELK host. After the log is completly imported I move the logfile to an archive.

The logfiles have the following structure:

"a93f1bd7-32fc-43a2-bf7e-929c1e539319";"2019-01-24T00:59:56+01:00";"perf;firstpaint;273";""
"a93f1bd7-32fc-43a2-bf7e-929c1e539319";"2019-01-24T00:59:56+01:00";"perf;jsinitialized;648";""
"1434c999-a818-458d-c2d1-6f3db7e1c19c";"2019-01-24T00:59:56+01:00";"productSliders;lastViewed;showProduct";""
"75e508ae-ac56-4a5b-f51e-caee70444f68";"2019-01-24T00:59:56+01:00";"abtest;PGAJA00;init";""
"957f5d1c-a927-43bd-d094-8539511a8e5b";"2019-01-24T00:59:56+01:00";"abtest;PGAJA00;init";""
"1434c999-a818-458d-c2d1-6f3db7e1c19c";"2019-01-24T00:59:57+01:00";"abtest;AZKQA00;controlled";""
"3cacfc25-ca22-4699-9865-d3e32c255d3a";"2019-01-24T00:59:57+01:00";"abtest;AZKQA00;pre-controlled";""
"362c7a33-ba7d-41a0-8800-e7b78f911c4d";"2019-01-24T00:59:57+01:00";"pwa;install;canShow";""
"5c8900bc-d5e1-4b0d-c1cd-9ec3269b5c66";"2019-01-24T00:59:58+01:00";"abtest;PGAJA03;init";""
"c0466a41-b531-4ca8-98ce-ac59b08e3cf7";"2019-01-24T00:59:59+01:00";"abtest;PGAJA00;init";""

And that's my logstash config file:

input {
  file {
    path => '/var/elk-logs/queue/*lfa_event_20*.log'
    start_position => "beginning"
    sincedb_path => "/dev/null"
    add_field => {
      "[@metadata][indexType]" => "lfaEvent"
    }
  }
}

filter {
  if [@metadata][indexType] == "lfaEvent" {
    if [message] in ["id;timestamp;event;test_variant", ""] {
      drop{ }
    } else {
      grok {
        match => { "message" => '"%{UUID:uuid}";"%{TIMESTAMP_ISO8601:[@metadata][timestamp]}";"%{DATA:event}";"%{DATA:test_variant}"' }
      }
      date {
        match => [ "[@metadata][timestamp]" , "ISO8601" ]
      }
    }
  }

  if "_grokparsefailure" in [tags] {
    #drop { }
  } else if [@metadata][indexType] == "lfaEvent" {
    fingerprint {
      source => "message"
      target => "[@metadata][fingerprint]"
      method => "SHA1"
      key => "budeLogfile"
    }
    mutate {
      remove_field => ["host", "path", "message"]
    }
  }
}

output {
  if "_grokparsefailure" in [tags] {
    elasticsearch {
      index => "logstash-grokparsefailure"
    }
  } else if [@metadata][indexType] == "lfaEvent" {
    elasticsearch {
      index => "logstash-lfa-event-%{+YYYY.MM.dd}"
      document_id => "%{[@metadata][fingerprint]}"
    }
  }
}

It works very well but sometimes a grokparsefailure is thrown. Here are some examples:

|Time |path    |message    |
|January 29th 2019, 16:20:42.891|/var/elk-logs/queue/bude-web2.priv-lfa_event_20190124_01_00.log|rf;firstpaint;709";""|
|January 29th 2019, 16:09:09.948|/var/elk-logs/queue/bude-web2.priv-lfa_event_20190124_00_02.log|d";""|
|January 29th 2019, 16:08:37.539|/var/elk-logs/queue/bude-web1.priv-lfa_event_20190129_15_02.log|"abtest;PGAJA00;init";""|
|January 29th 2019, 16:08:37.536|/var/elk-logs/queue/bude-web1.priv-lfa_event_20190129_15_01.log|7+01:00";"abtest;AZKQA00;unsupported";""|
|January 29th 2019, 16:08:37.519|/var/elk-logs/queue/bude-web1.priv-lfa_event_20190129_15_03.log|53+01:00";"abtest;PGAJA00;init";""|
|January 29th 2019, 16:08:22.253|/var/elk-logs/queue/bude-web1.priv-lfa_event_20190124_23_02.log|9-01-24T23:40:02+01:00";"abtest;PGAJA00;init";""|
|January 29th 2019, 16:08:21.635|/var/elk-logs/queue/bude-web1.priv-lfa_event_20190124_23_00.log|:19+01:00";"addToCart;click;41981735";""|

I have this problem since months and my dirty solution is for checking the grokparsefailures. If I detect this failure the logfile is deleted and after a few minutes the logfile is copied again to the logstash import folder. That's the reason why I set the sincedb_path to /dev/null
Normaly after the second try all rows are imported. Sometime it needs 4 or 5 tries until the file is completly imported.

I have absolute no idea why this happens. I seems like logstash stops reading the new file randomly after n bytes. There are no rows which contain only the message part from the grokparsefailure. The string at the message is every time a part of a valid row.

If there are for example 10 files which aren't imported completly I can restart the elk docker container and then all files are imported completly after logstash is started! I tried this a few times and it worked each time.

Are you reusing file names? If so, use randomised names.

Even though the sincedb is being persisted to /dev/null i.e not to a file, the positions per file (inode) are being kept in memory.

If a file with the same name (but different inode) is detected then this may be seen as a file rotation and the file input may read from 0 or the position associated with the previous file - if previous position is used, then it probably is not at a line boundary so the "line" read is not a full proper line.

There is also the possibility of inode re-cycling when an inode is reused by the OS so the file input thinks the new file is the same as the other file with a new name - a form of rotation again. If inode re-cycling is suspected then use sincedb_clean_after => "120s", this should remove the entry from the in-memory part of the sincedb 120 seconds after the last (internal) activity was recorded for that inode.

Thank you so much for your answer. This was very helpful for me.

My import script deletes the imported files after a successful import and copy the new files to the import path. The problem is that the os (ubuntu 18.10 host) uses some times the same inode id for a new file. Exactly like you thought:

2019-02-04 15:10:24	import file bude-web2.priv-lfa_event_20190202_08.log 1182341
2019-02-04 15:10:46	import file bude-web2.priv-lfa_event_20190202_09.log 1182343
2019-02-04 15:11:09	import file bude-web2.priv-lfa_event_20190202_10.log 1182354
2019-02-04 15:12:02	moved file bude-web2.priv-lfa_event_20190202_08.log to path /var/elk/logs/archive/2019/02/
2019-02-04 15:12:02	moved file bude-web2.priv-lfa_event_20190202_09.log to path /var/elk/logs/archive/2019/02/
2019-02-04 15:12:31	import file bude-web2.priv-lfa_event_20190202_11.log 1182341
2019-02-04 15:12:47	import file bude-web2.priv-lfa_event_20190202_12.log 1182343

I changed my importer that I have unique file names each time. I just added a current timestamp to the file during the copy job.
I activated a sincedb file and I also set sincedb_clean_after to 120s. Both didn't work.

input {
  file {
    path => '/var/elk-logs/queue/*lfa_event_20*.log'
    start_position => "beginning"
    sincedb_path => "/var/elk-logs/sincedb/sincedb"
    sincedb_clean_after => "120s"
    add_field => {
      "[@metadata][indexType]" => "lfaEvent"
    }
  }
}

How does the sincedb work internally? Could I change this file externaly?
Every time I delete a file from the import path I could delete the inode row from the sincedb.
If a sincedb is set is something stored at the memory? Is it safe to update the sincedb?

The sincedb file contents are the persisted state of the in-memory collection. It is persisted every sincedb_write_interval (15 seconds, default). Changing the file, if your changes are not overwritten, will only be read once a restart. So, no, there is no benefit to manipulating the file's contents while LS is running.

I forgot to tell you to set close_older(for tail mode) to a value less than sincedb_clean_after.

Have you considered read mode with the in-built file_completed_action setting (with a value of ' log_and_delete')? Read mode is designed specifically for this use case. The sincedb file is only useful if LS must restart (crash or manual) while reading some files.

To use this effectively:

  • Your import script needs to do a copy and rename. Copy to a file name "outside the glob pattern" then rename to a file name that will be discovered via the glob. The rename is atomic, meaning that the file can only be "discovered" when the file contents are completely written and stable.
  • You should set ignore_older (for read mode) to 100s and sincedb_clean_after to 120s. With some observation these can probably be reduced to 30s and 40s (i.e. longer than the LS restart time).
  • you should consider file_chunk_count and file_chunk_size, the defaults mean that one file is read from beginning to end before the next file can be started. By setting, 1 and 8192 you can get 8K read from each file in a round robin fashion until all content is read.

Thanks a lot. I didn't know the mode "read" until now. That's exactly what I was looking for.

With this setup it works perfect now:

file {
    path => '/var/elk-logs/queue/*lfa_event_20*.log'
    mode => "read"
    sincedb_path => "/var/elk-logs/sincedb/sincedb"
    file_completed_action => "log_and_delete"
    file_completed_log_path => "/var/elk-logs/logs/complete.log"
    ignore_older => 50
    sincedb_clean_after => "60s"
    add_field => {
      "[@metadata][indexType]" => "lfaEvent"
    }
  }

Like you suggested I copy the new file with a temp file name and rename it after the copy process.

At my log I see still the same inode IDs but they didn't cause any problems.

Great!! I'm glad you have a solution that works for your use case.

The ignore_older setting can take a time string as well - it will make the config easier to read in 6 months time :slight_smile:

ignore_older => "50s"

Do the sincedb entries get removed?

I need to know because it is a bug if they are not.

The number of lines in the sincedb should not keep growing and you should not see entries in the sincedb say from 75 seconds (max 60 + 15) after the path entry goes into the complete.log file.

Good point with the documentation of the value 50 vs 50s :wink:

The entries at the sincedb are not removed but they are replaced correctly. That's my current sincedb

1182337 0 2053 3033045 1549296642.0830069 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_16_1549296602125.log
1182339 0 2053 3473656 1549296749.335725 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_17_1549296722379.log
1182358 0 2053 3125107 1549297045.0945852 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_21_1549296962918.log
1195866 0 2053 2708722 1549297061.805411 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_22_1549296962919.log
1182354 0 2053 1557767 1549297077.740204 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_23_1549296962920.log
1182336 0 2053 2948222 1549297579.2358599 /var/elk-logs/queue/bude-web2.priv-lfa_event_20190203_13_1549297442104.log
1182332 0 2053 4146833 1549298297.4975889 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190124_17_1549298162128.log
1182345 0 2053 802647 1549299294.223642 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190131_02_1549299122762.log
1182349 0 2053 3758399 1549299684.594742 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190131_16_1549299603242.log
1182356 0 2053 164882 1549299976.2981322 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_03_1549299844716.log
1182357 0 2053 243749 1549300021.485718 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_05_1549299844722.log
1182348 0 2053 533301 1549300036.747678 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_06_1549299844729.log
1182361 0 2053 992265 1549300052.322504 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_07_1549299844745.log
1182363 0 2053 2053450 1549300144.63623 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_09_1549300083116.log
1182362 0 2053 2981211 1549300226.3445342 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_13_1549300083130.log
1182251 0 2053 2917119 1549300276.737205 /var/elk-logs/queue/bude-web3.priv-lfa_event_20190201_16_1549300083147.log

And that's part of my Log:

|2019-02-04 17:48:31|import file bude-web3.priv-lfa_event_20190130_09.log 1182348|
|2019-02-04 17:48:50|import file bude-web3.priv-lfa_event_20190130_10.log 1182345|
|2019-02-04 17:49:12|import file bude-web3.priv-lfa_event_20190130_11.log 1182345|
|2019-02-04 17:49:29|import file bude-web3.priv-lfa_event_20190130_12.log 1182345|
|2019-02-04 17:49:45|import file bude-web3.priv-lfa_event_20190130_13.log 1182345|
|2019-02-04 17:50:13|import file bude-web3.priv-lfa_event_20190130_14.log 1182348|
|2019-02-04 17:50:25|import file bude-web3.priv-lfa_event_20190130_15.log 1182349|
|2019-02-04 17:50:46|import file bude-web3.priv-lfa_event_20190130_16.log 1182348|
|2019-02-04 17:51:08|import file bude-web3.priv-lfa_event_20190130_17.log 1182345|
|2019-02-04 17:51:28|import file bude-web3.priv-lfa_event_20190130_18.log 1182345|
|2019-02-04 17:52:28|import file bude-web3.priv-lfa_event_20190130_19.log 1182348|
|2019-02-04 17:52:46|import file bude-web3.priv-lfa_event_20190130_20.log 1182345|
|2019-02-04 17:53:09|import file bude-web3.priv-lfa_event_20190130_21.log 1182345|
|2019-02-04 17:53:30|import file bude-web3.priv-lfa_event_20190130_22.log 1182345|
|2019-02-04 17:53:46|import file bude-web3.priv-lfa_event_20190130_23.log 1182345|
|2019-02-04 17:54:08|import file bude-web3.priv-lfa_event_20190131_00.log 1182348|
|2019-02-04 17:54:29|import file bude-web3.priv-lfa_event_20190131_01.log 1182345|
|2019-02-04 17:54:49|import file bude-web3.priv-lfa_event_20190131_02.log 1182345|
|2019-02-04 17:55:12|import file bude-web3.priv-lfa_event_20190131_03.log 1182348|
|2019-02-04 17:55:28|import file bude-web3.priv-lfa_event_20190131_04.log 1182251|
|2019-02-04 17:56:29|import file bude-web3.priv-lfa_event_20190131_05.log 1182251|
|2019-02-04 17:56:46|import file bude-web3.priv-lfa_event_20190131_06.log 1182251|
|2019-02-04 17:57:13|import file bude-web3.priv-lfa_event_20190131_07.log 1182251|
|2019-02-04 17:57:25|import file bude-web3.priv-lfa_event_20190131_08.log 1182251|
|2019-02-04 17:57:50|import file bude-web3.priv-lfa_event_20190131_09.log 1182251|
|2019-02-04 17:58:05|import file bude-web3.priv-lfa_event_20190131_10.log 1182348|
|2019-02-04 17:58:33|import file bude-web3.priv-lfa_event_20190131_11.log 1182348|
|2019-02-04 17:58:47|import file bude-web3.priv-lfa_event_20190131_12.log 1182251|
|2019-02-04 17:59:11|import file bude-web3.priv-lfa_event_20190131_13.log 1182251|
|2019-02-04 18:00:28|import file bude-web3.priv-lfa_event_20190131_14.log 1182348|
|2019-02-04 18:00:49|import file bude-web3.priv-lfa_event_20190131_15.log 1182251|
|2019-02-04 18:01:05|import file bude-web3.priv-lfa_event_20190131_16.log 1182349|
|2019-02-04 18:01:33|import file bude-web3.priv-lfa_event_20190131_17.log 1182348|
|2019-02-04 18:01:48|import file bude-web3.priv-lfa_event_20190131_18.log 1182251|
|2019-02-04 18:02:07|import file bude-web3.priv-lfa_event_20190131_19.log 1182251|
|2019-02-04 18:02:24|import file bude-web3.priv-lfa_event_20190131_20.log 1182251|
|2019-02-04 18:02:48|import file bude-web3.priv-lfa_event_20190131_21.log 1182251|
|2019-02-04 18:03:11|import file bude-web3.priv-lfa_event_20190131_22.log 1182251|
|2019-02-04 18:03:33|import file bude-web3.priv-lfa_event_20190131_23.log 1182251|
|2019-02-04 18:04:26|import file bude-web1.priv-lfa_event_20190204_17.log 1182348|
|2019-02-04 18:04:55|import file bude-web2.priv-lfa_event_20190204_17.log 1182251|
|2019-02-04 18:05:05|import file bude-web3.priv-lfa_event_20190201_00.log 1182251|
|2019-02-04 18:05:32|import file bude-web3.priv-lfa_event_20190201_01.log 1182348|
|2019-02-04 18:05:53|import file bude-web3.priv-lfa_event_20190201_02.log 1182251|
|2019-02-04 18:06:05|import file bude-web3.priv-lfa_event_20190201_03.log 1182356|
|2019-02-04 18:06:27|import file bude-web3.priv-lfa_event_20190201_04.log 1182348|
|2019-02-04 18:06:47|import file bude-web3.priv-lfa_event_20190201_05.log 1182357|
|2019-02-04 18:07:08|import file bude-web3.priv-lfa_event_20190201_06.log 1182348|
|2019-02-04 18:07:26|import file bude-web3.priv-lfa_event_20190201_07.log 1182361|
|2019-02-04 18:08:32|import file bude-web3.priv-lfa_event_20190201_08.log 1182362|
|2019-02-04 18:08:47|import file bude-web3.priv-lfa_event_20190201_09.log 1182363|
|2019-02-04 18:09:08|import file bude-web3.priv-lfa_event_20190201_10.log 1182362|
|2019-02-04 18:09:32|import file bude-web3.priv-lfa_event_20190201_11.log 1182251|
|2019-02-04 18:09:50|import file bude-web3.priv-lfa_event_20190201_12.log 1182251|
|2019-02-04 18:10:11|import file bude-web3.priv-lfa_event_20190201_13.log 1182362|
|2019-02-04 18:10:32|import file bude-web3.priv-lfa_event_20190201_14.log 1182251|
|2019-02-04 18:10:52|import file bude-web3.priv-lfa_event_20190201_15.log 1182251|
|2019-02-04 18:11:05|import file bude-web3.priv-lfa_event_20190201_16.log 1182251|

There are still some old entries but the inodes are all correct.

Good to know, thanks.

1 Like

Just as a short feedback. I didn't recieved any grokparsefailure since I changed it yesterday and I imported arround 250 files during the last night.

Today morning I removed the ignore_older and sincedb_clean_after. It still works without this two parameteres without trouble. The reading mode and let logstash delete the file did the trick.

Thanks again for the fast help!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.