Dropping log lines on log rotation

File beat version: 5.6.3
os: centos

Occasionally log data will be missing that appears to come after a log rotate. If we have two files "old_log" and "new_log" all log lines from "old_log" will be present but the first couple(could be alot) of "new_log" will be missing. Also the first line that is picked up from "new_log" when this happens will not be complete and contain a "_jsonparsefailure". After the jsonparsefailure everything seems to return to normal but that line and the preceding lines never end up in elasticsearch(presumably never even get shipped to logstash).

quick_edit: these logs are coming from java applications using logback to log with a time based rolling policy.

relevant configs

filebeat:
  config_dir: /etc/filebeat/conf.d
  idle_timeout: 5s
  publish_async: false
  registry_file: /var/lib/filebeat/registry
  spool_size: 600
logging:
  files:
    keepfiles: 5
    name: filebeat.log
    path: /var/log/filebeat
    rotateeverybytes: 10485760
  level: info
  to_files: true
  to_syslog: false
output:
  logstash:
    bulk_max_size: 100
    compression_level: 3
    hosts: ['fake-008','fake-009','fake-010','fake-011','fake-012','fake-007']
    loadbalance: true
    max_retries: -1
    worker: 1
filebeat:
  prospectors:
  -
    document_type: service-json
    paths:
      - /usr/local/services/*/log/*-logstash.log
      - /var/log/services/*-logstash.log

It would be good to isolate the problem to Filebeat. Can you reproduce the issue if you are only using the file output?

Can you please share your Logstash config too.

I will try to capture this with the file output but I think it will be hard because it does not happen frequently .

version : 5.6.3

logstash.yml

path.data: /var/lib/logstash
path.config: /etc/logstash/conf.d
path.logs: /var/log/logstash
pipeline.workers: 16
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.url: ["http://es-001:9200", "http://es-002:9200", "http://es-003:9200"]

Just wanted to add that I tested our log rotation method to make sure that it was not truncating logs and it appears to be doing what it should be doing by renaming the old file keeping the same inode and creating a new file with a new inode for the new log.

$ ls -i test*
28583565 test.log
$ ls -i test*
28583578 test.log              28583565 test.log.201803220833
1 Like

Please include the configs from /etc/logstash/conf.d so we can understand what Logstash is doing with the data.

I think these are the relevant ones

10-input-filebeat

input {
  beats {
    port => 10200
    tags => []
  }
}

20-filter-start

filter {
  if "metric" not in [tags] {
    # Capture the @timestamp for use in lag calculations.
    # Capture the current time in nanos for use in filter timing.
    # Assign future and past limits (in seconds) for events.
    # Events that fall outside these limits are discarded.
    ruby {
      code => "
        event.set('[@metadata][elk_rcvd]', event.get('@timestamp'));
        event.set('[@metadata][elk_lag_future]', 86400.0);
        event.set('[@metadata][elk_lag_past]', 604800.0);
      "
    }
  }
}

21-filter-service-json

filter {
  if [type] == "service-json" {

    # Pull the embedded json fields up to the top level.
    json { source => "message" }

    # Convert and trim
    mutate {
      # Mark for output to elasticsearch
      add_field => { "[@metadata][elastic_output]" => true }
      add_field => { "[@metadata][elastic_index]" => "service" }

      # Rename some common fields to their indexed names.
      rename => {
        "request:header:x-site"    => "site_id"
        "request:header:x-trace"   => "trace_id"
        "request:header:x-reqname" => "requestor_id"
        "request:header:x-reqhost" => "requestor_host"
      }

      # Coerce some popular fields that are known to have inconsistencies
      convert => {
        "site_id" => "string"
        "shard_id" => "string"
      }
    }
  }
}

29-filter-stop

filter {
  # Only collect metrics for items that are heading to elasticsearch
  # This will filter out anything else (e.g. metrics) that are in the pipeline.
  if [@metadata][elastic_output] {
    # Calculate ingestion lag
    ruby {
      code => "
        event.set('elk_lag', event.get('[@metadata][elk_rcvd]') - event.get('@timestamp'));
        event.set('[@metadata][elk_lag_future]', -1 *  event.get('[@metadata][elk_lag_future]'));
      "
    }

    # Drop events > elk_lag_future seconds in the future
    # Drop events > elk_lag_past seconds in the past
    if [elk_lag] {
      if [elk_lag] < [@metadata][elk_lag_future] {
        drop {}
      }
      if [elk_lag] > [@metadata][elk_lag_past] {
        drop {}
      }
    }

    # Rates by input type, output index
    metrics {
      meter           => "type.%{type}"
      meter           => "index.%{[@metadata][elastic_index]}"
      clear_interval  => 60
      flush_interval  => 60
      rates           => [1,5]

      # Mark emit event for output to graphite
      add_field => { "[@metadata][graphite_output]" => true }
      add_field => { "[@metadata][graphite_index]" => "logstash" }
    }
  }
}

30-output-elastic

  # Only include events with the "elasticsearch" tag
  # Filters that interpret incoming logs should add this tag explcitly.
  # This will help keep stray events (e.g. metrics) from being indexed in
  # ElasticSearch.
  if [@metadata][elastic_output] {

    if [@metadata][audit_data] {
      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "audit-%{+YYYY.MM}"
        manage_template => false
      }

      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "infrastructure-%{+YYYY.MM.dd}"
        manage_template => false
      }
    } else {

      elasticsearch {
        hosts       => ['es-001','es-002','es-003']
        flush_size  => 4096
        index       => "%{[@metadata][elastic_index]}-%{+YYYY.MM.dd}"
        manage_template => false
      }
    }
  }
}

Are these symptoms the same that would be caused by inode reuse? The thing is they happen a couple of times per day with log rotation set at every hour which seems to high to be inode reuse.

On Linux file systems, Filebeat uses the inode and device to identify files. When a file is removed from disk, the inode may be assigned to a new file. In use cases involving file rotation, if an old file is removed and a new one is created immediately afterwards, the new file may have the exact same inode as the file that was removed. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position, which is not correct.

By default states are never removed from the registry file. To resolve the inode reuse issue, we recommend that you use the clean_* options, especially clean_inactive, to remove the state of inactive files. For example, if your files get rotated every 24 hours, and the rotated files are not updated anymore, you can set ignore_older to 48 hours and clean_inactive to 72 hours.

You can use clean_removed for files that are removed from disk. Be aware that clean_removed cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.

If it is inode reuse I would expect to see Filebeat "resuming" on your new files from a much higher offset such that you lose a lot more than "first couple(could be alot) of new_log". I think that if you enable debug logging for the prospector and harvesterselectors you should be able to prove or disprove the theory based on this log line.

logging.level: debug
logging.selectors: [prospector, harvester]

Using the file output method you suggested it does look like it's a filebeat issue. This is the first message that filebeat picked up after a log rotation happened. The logline is split as you can see by the message. The actual logline is correct json.

{
  "@timestamp": "2018-03-22T15:30:04.900Z",
  "beat": {
    "hostname": "frodo-stg-002",
    "name": "frodo-stg-002",
    "version": "5.1.1"
  },
  "input_type": "log",
  "message": "ityDeleteAction.execute(EntityDeleteAction.java:114)\\n\\tat org.hibernate.engine.spi.ActionQueue.executeAct    ions(ActionQueue.java:463)\\n\\tat org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:349)\\n\\tat org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:350)\\n\\tat org.hibernate.event.intern    al.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)\\n\\tat org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1222)\\n\\tat org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:425)\\n\\tat org.hibernate.engine.transaction.inter    nal.jdbc.JdbcTransaction.beforeTransactionCommit(JdbcTransaction.java:101)\\n\\tat org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:177)\\n\\tat org.springframework.orm.hibernate4.HibernateTransactionManager.doCommit(Hib    ernateTransactionManager.java:584)\\n\\t... 35 common frames omitted\\n\",\"service\":\"frodo\"}",
  "offset": 13416,
  "source": "/usr/local/frodo/log/service-logstash.log",
  "type": "service-json"
}

You said Filebeat 5.6.3, but the data says 5.1.1. It would be good to try running the latest version.

This also happens on hosts with filebeat 5.6.6 but I will try to get a message that states that.

/usr/share/filebeat/bin/filebeat --version
filebeat version 5.6.6 (amd64), libbeat 5.6.6

With debug enabled, in the logs are you seeing

Start harvester for new file: /usr/local/frodo/log/service-logstash.log

when rotation occurs? Or are you seeing something like

"Update existing file for harvesting: /usr/local/frodo/log/service-logstash.log, offset: <offset>?

Yes and it seemed off to me that the first occurrence of it is even though filebeat was restarted for this test shortly after the log was rotated.

Update existing file for harvesting: /usr/local/frodo/log/service-logstash.log, offset: 23099

Seems like an inode reuse issue. Try deploying the options from https://www.elastic.co/guide/en/beats/filebeat/5.6/faq.html#inode-reuse-issue.

Thanks for your help! I will try that and let you know if it solves the issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.