Occasionally log data will be missing that appears to come after a log rotate. If we have two files "old_log" and "new_log" all log lines from "old_log" will be present but the first couple(could be alot) of "new_log" will be missing. Also the first line that is picked up from "new_log" when this happens will not be complete and contain a "_jsonparsefailure". After the jsonparsefailure everything seems to return to normal but that line and the preceding lines never end up in elasticsearch(presumably never even get shipped to logstash).
quick_edit: these logs are coming from java applications using logback to log with a time based rolling policy.
Just wanted to add that I tested our log rotation method to make sure that it was not truncating logs and it appears to be doing what it should be doing by renaming the old file keeping the same inode and creating a new file with a new inode for the new log.
$ ls -i test*
28583565 test.log
$ ls -i test*
28583578 test.log 28583565 test.log.201803220833
filter {
if "metric" not in [tags] {
# Capture the @timestamp for use in lag calculations.
# Capture the current time in nanos for use in filter timing.
# Assign future and past limits (in seconds) for events.
# Events that fall outside these limits are discarded.
ruby {
code => "
event.set('[@metadata][elk_rcvd]', event.get('@timestamp'));
event.set('[@metadata][elk_lag_future]', 86400.0);
event.set('[@metadata][elk_lag_past]', 604800.0);
"
}
}
}
21-filter-service-json
filter {
if [type] == "service-json" {
# Pull the embedded json fields up to the top level.
json { source => "message" }
# Convert and trim
mutate {
# Mark for output to elasticsearch
add_field => { "[@metadata][elastic_output]" => true }
add_field => { "[@metadata][elastic_index]" => "service" }
# Rename some common fields to their indexed names.
rename => {
"request:header:x-site" => "site_id"
"request:header:x-trace" => "trace_id"
"request:header:x-reqname" => "requestor_id"
"request:header:x-reqhost" => "requestor_host"
}
# Coerce some popular fields that are known to have inconsistencies
convert => {
"site_id" => "string"
"shard_id" => "string"
}
}
}
}
29-filter-stop
filter {
# Only collect metrics for items that are heading to elasticsearch
# This will filter out anything else (e.g. metrics) that are in the pipeline.
if [@metadata][elastic_output] {
# Calculate ingestion lag
ruby {
code => "
event.set('elk_lag', event.get('[@metadata][elk_rcvd]') - event.get('@timestamp'));
event.set('[@metadata][elk_lag_future]', -1 * event.get('[@metadata][elk_lag_future]'));
"
}
# Drop events > elk_lag_future seconds in the future
# Drop events > elk_lag_past seconds in the past
if [elk_lag] {
if [elk_lag] < [@metadata][elk_lag_future] {
drop {}
}
if [elk_lag] > [@metadata][elk_lag_past] {
drop {}
}
}
# Rates by input type, output index
metrics {
meter => "type.%{type}"
meter => "index.%{[@metadata][elastic_index]}"
clear_interval => 60
flush_interval => 60
rates => [1,5]
# Mark emit event for output to graphite
add_field => { "[@metadata][graphite_output]" => true }
add_field => { "[@metadata][graphite_index]" => "logstash" }
}
}
}
30-output-elastic
# Only include events with the "elasticsearch" tag
# Filters that interpret incoming logs should add this tag explcitly.
# This will help keep stray events (e.g. metrics) from being indexed in
# ElasticSearch.
if [@metadata][elastic_output] {
if [@metadata][audit_data] {
elasticsearch {
hosts => ['es-001','es-002','es-003']
flush_size => 4096
index => "audit-%{+YYYY.MM}"
manage_template => false
}
elasticsearch {
hosts => ['es-001','es-002','es-003']
flush_size => 4096
index => "infrastructure-%{+YYYY.MM.dd}"
manage_template => false
}
} else {
elasticsearch {
hosts => ['es-001','es-002','es-003']
flush_size => 4096
index => "%{[@metadata][elastic_index]}-%{+YYYY.MM.dd}"
manage_template => false
}
}
}
}
Are these symptoms the same that would be caused by inode reuse? The thing is they happen a couple of times per day with log rotation set at every hour which seems to high to be inode reuse.
On Linux file systems, Filebeat uses the inode and device to identify files. When a file is removed from disk, the inode may be assigned to a new file. In use cases involving file rotation, if an old file is removed and a new one is created immediately afterwards, the new file may have the exact same inode as the file that was removed. In this case, Filebeat assumes that the new file is the same as the old and tries to continue reading at the old position, which is not correct.
By default states are never removed from the registry file. To resolve the inode reuse issue, we recommend that you use the clean_* options, especially clean_inactive, to remove the state of inactive files. For example, if your files get rotated every 24 hours, and the rotated files are not updated anymore, you can set ignore_older to 48 hours and clean_inactive to 72 hours.
You can use clean_removed for files that are removed from disk. Be aware that clean_removed cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.
If it is inode reuse I would expect to see Filebeat "resuming" on your new files from a much higher offset such that you lose a lot more than "first couple(could be alot) of new_log". I think that if you enable debug logging for the prospector and harvesterselectors you should be able to prove or disprove the theory based on this log line.
Using the file output method you suggested it does look like it's a filebeat issue. This is the first message that filebeat picked up after a log rotation happened. The logline is split as you can see by the message. The actual logline is correct json.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.