Dropping log lines on log rotation

DDEFISHER · March 21, 2018, 4:48pm

File beat version: 5.6.3
os: centos

Occasionally log data will be missing that appears to come after a log rotate. If we have two files "old_log" and "new_log" all log lines from "old_log" will be present but the first couple(could be alot) of "new_log" will be missing. Also the first line that is picked up from "new_log" when this happens will not be complete and contain a "_jsonparsefailure". After the jsonparsefailure everything seems to return to normal but that line and the preceding lines never end up in elasticsearch(presumably never even get shipped to logstash).

quick_edit: these logs are coming from java applications using logback to log with a time based rolling policy.

relevant configs

filebeat:
  config_dir: /etc/filebeat/conf.d
  idle_timeout: 5s
  publish_async: false
  registry_file: /var/lib/filebeat/registry
  spool_size: 600
logging:
  files:
    keepfiles: 5
    name: filebeat.log
    path: /var/log/filebeat
    rotateeverybytes: 10485760
  level: info
  to_files: true
  to_syslog: false
output:
  logstash:
    bulk_max_size: 100
    compression_level: 3
    hosts: ['fake-008','fake-009','fake-010','fake-011','fake-012','fake-007']
    loadbalance: true
    max_retries: -1
    worker: 1

filebeat:
  prospectors:
  -
    document_type: service-json
    paths:
      - /usr/local/services/*/log/*-logstash.log
      - /var/log/services/*-logstash.log

andrewkroh · March 22, 2018, 11:32am

It would be good to isolate the problem to Filebeat. Can you reproduce the issue if you are only using the file output?

Can you please share your Logstash config too.

DDEFISHER · March 22, 2018, 12:27pm

I will try to capture this with the file output but I think it will be hard because it does not happen frequently .

version : 5.6.3

logstash.yml

path.data: /var/lib/logstash
path.config: /etc/logstash/conf.d
path.logs: /var/log/logstash
pipeline.workers: 16
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.url: ["http://es-001:9200", "http://es-002:9200", "http://es-003:9200"]

DDEFISHER · March 22, 2018, 12:34pm

Just wanted to add that I tested our log rotation method to make sure that it was not truncating logs and it appears to be doing what it should be doing by renaming the old file keeping the same inode and creating a new file with a new inode for the new log.

$ ls -i test*
28583565 test.log
$ ls -i test*
28583578 test.log              28583565 test.log.201803220833

andrewkroh · March 22, 2018, 12:40pm

Please include the configs from /etc/logstash/conf.d so we can understand what Logstash is doing with the data.

DDEFISHER · March 22, 2018, 12:52pm

I think these are the relevant ones

10-input-filebeat

input {
  beats {
    port => 10200
    tags => []
  }
}

20-filter-start

filter {
  if "metric" not in [tags] {
    # Capture the @timestamp for use in lag calculations.
    # Capture the current time in nanos for use in filter timing.
    # Assign future and past limits (in seconds) for events.
    # Events that fall outside these limits are discarded.
    ruby {
      code => "
        event.set('[@metadata][elk_rcvd]', event.get('@timestamp'));
        event.set('[@metadata][elk_lag_future]', 86400.0);
        event.set('[@metadata][elk_lag_past]', 604800.0);
      "
    }
  }
}

21-filter-service-json

filter {
  if [type] == "service-json" {

    # Pull the embedded json fields up to the top level.
    json { source => "message" }

    # Convert and trim
    mutate {
      # Mark for output to elasticsearch
      add_field => { "[@metadata][elastic_output]" => true }
      add_field => { "[@metadata][elastic_index]" => "service" }

      # Rename some common fields to their indexed names.
      rename => {
        "request:header:x-site"    => "site_id"
        "request:header:x-trace"   => "trace_id"
        "request:header:x-reqname" => "requestor_id"
        "request:header:x-reqhost" => "requestor_host"
      }

      # Coerce some popular fields that are known to have inconsistencies
      convert => {
        "site_id" => "string"
        "shard_id" => "string"
      }
    }
  }
}

29-filter-stop

filter {
  # Only collect metrics for items that are heading to elasticsearch
  # This will filter out anything else (e.g. metrics) that are in the pipeline.
  if [@metadata][elastic_output] {
    # Calculate ingestion lag
    ruby {
      code => "
        event.set('elk_lag', event.get('[@metadata][elk_rcvd]') - event.get('@timestamp'));
        event.set('[@metadata][elk_lag_future]', -1 *  event.get('[@metadata][elk_lag_future]'));
      "
    }

    # Drop events > elk_lag_future seconds in the future
    # Drop events > elk_lag_past seconds in the past
    if [elk_lag] {
      if [elk_lag] < [@metadata][elk_lag_future] {
        drop {}
      }
      if [elk_lag] > [@metadata][elk_lag_past] {
        drop {}
      }
    }

    # Rates by input type, output index
    metrics {
      meter           => "type.%{type}"
      meter           => "index.%{[@metadata][elastic_index]}"
      clear_interval  => 60
      flush_interval  => 60
      rates           => [1,5]

      # Mark emit event for output to graphite
      add_field => { "[@metadata][graphite_output]" => true }
      add_field => { "[@metadata][graphite_index]" => "logstash" }
    }
  }
}

30-output-elastic

  # Only include events with the "elasticsearch" tag
  # Filters that interpret incoming logs should add this tag explcitly.
  # This will help keep stray events (e.g. metrics) from being indexed in
  # ElasticSearch.
  if [@metadata][elastic_output] {

    if [@metadata][audit_data] {
      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "audit-%{+YYYY.MM}"
        manage_template => false
      }

      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "infrastructure-%{+YYYY.MM.dd}"
        manage_template => false
      }
    } else {

      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "%{[@metadata][elastic_index]}-%{+YYYY.MM.dd}"
        manage_template => false
      }
    }
  }
}

DDEFISHER · March 22, 2018, 1:02pm

Are these symptoms the same that would be caused by inode reuse? The thing is they happen a couple of times per day with log rotation set at every hour which seems to high to be inode reuse.

On Linux file systems, Filebeat uses the inode and device to identify files. When a file is removed from disk, the inode may be assigned to a new file. In use cases involving file rotation, if an old file is removed and a new one is created immediately afterwards, the new file may have the exact same inode as the file that was removed. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position, which is not correct.

By default states are never removed from the registry file. To resolve the inode reuse issue, we recommend that you use the clean_* options, especially clean_inactive, to remove the state of inactive files. For example, if your files get rotated every 24 hours, and the rotated files are not updated anymore, you can set ignore_older to 48 hours and clean_inactive to 72 hours.

You can use clean_removed for files that are removed from disk. Be aware that clean_removed cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.

andrewkroh · March 22, 2018, 1:43pm

If it is inode reuse I would expect to see Filebeat "resuming" on your new files from a much higher offset such that you lose a lot more than "first couple(could be alot) of new_log". I think that if you enable debug logging for the prospector and harvesterselectors you should be able to prove or disprove the theory based on this log line.

logging.level: debug
logging.selectors: [prospector, harvester]

DDEFISHER · March 22, 2018, 3:49pm

Using the file output method you suggested it does look like it's a filebeat issue. This is the first message that filebeat picked up after a log rotation happened. The logline is split as you can see by the message. The actual logline is correct json.

{
  "@timestamp": "2018-03-22T15:30:04.900Z",
  "beat": {
    "hostname": "frodo-stg-002",
    "name": "frodo-stg-002",
    "version": "5.1.1"
  },
  "input_type": "log",
  "message": "ityDeleteAction.execute(EntityDeleteAction.java:114)\\n\\tat org.hibernate.engine.spi.ActionQueue.executeAct    ions(ActionQueue.java:463)\\n\\tat org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:349)\\n\\tat org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:350)\\n\\tat org.hibernate.event.intern    al.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)\\n\\tat org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1222)\\n\\tat org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:425)\\n\\tat org.hibernate.engine.transaction.inter    nal.jdbc.JdbcTransaction.beforeTransactionCommit(JdbcTransaction.java:101)\\n\\tat org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:177)\\n\\tat org.springframework.orm.hibernate4.HibernateTransactionManager.doCommit(Hib    ernateTransactionManager.java:584)\\n\\t... 35 common frames omitted\\n\",\"service\":\"frodo\"}",
  "offset": 13416,
  "source": "/usr/local/frodo/log/service-logstash.log",
  "type": "service-json"
}

andrewkroh · March 22, 2018, 3:56pm

You said Filebeat 5.6.3, but the data says 5.1.1. It would be good to try running the latest version.

DDEFISHER · March 22, 2018, 4:02pm

This also happens on hosts with filebeat 5.6.6 but I will try to get a message that states that.

/usr/share/filebeat/bin/filebeat --version
filebeat version 5.6.6 (amd64), libbeat 5.6.6

andrewkroh · March 22, 2018, 4:31pm

With debug enabled, in the logs are you seeing

Start harvester for new file: /usr/local/frodo/log/service-logstash.log

when rotation occurs? Or are you seeing something like

"Update existing file for harvesting: /usr/local/frodo/log/service-logstash.log, offset: <offset>?

DDEFISHER · March 22, 2018, 4:33pm

Yes and it seemed off to me that the first occurrence of it is even though filebeat was restarted for this test shortly after the log was rotated.

Update existing file for harvesting: /usr/local/frodo/log/service-logstash.log, offset: 23099

andrewkroh · March 22, 2018, 4:44pm

Seems like an inode reuse issue. Try deploying the options from https://www.elastic.co/guide/en/beats/filebeat/5.6/faq.html#inode-reuse-issue.

DDEFISHER · March 23, 2018, 4:14pm

Thanks for your help! I will try that and let you know if it solves the issue.

system · April 20, 2018, 4:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat missing some log lines Beats filebeat	4	2071	July 5, 2017
Missing logs with rotate log Beats filebeat	15	5095	July 21, 2016
Filebeat missing log lines Beats filebeat	16	8137	July 5, 2017
Filebeat data loss with file rotation and elasticsearch not reachable Beats filebeat	5	1284	November 15, 2017
Does Filebeat take in logs missed when it is stopped? Beats filebeat	4	499	June 2, 2021

Dropping log lines on log rotation

Related topics