Multiple entries for single file in sincedb

Hi,

I read a single file from server(C:/Users/xx.log - its a rotating log file daily) and route to kafka. In a single sincedb file, we see multiple entries like below. Is this valid. we see dupes on Kafka topic. Is it possible that it reads offset from 1st line read old data and then read 2nd line offset then it starts reading latest data. on 10/17, we still see data from 10/12 pushed to kafka. will it be due to below multiple entry.

a. If so can I just delete the first entry and will that be taken care or do I need to delete sincedb file and restart logstash. In my logstash config, I didnt mention start_position or sincedb path. Do I need to mention these two parms in my config.

b. Do we know why multiple entries are created in single file. If we know, how can we avoid next time.

Any help appreciated. Thanks.

123456-789-012345 0 0 192463431 1602474319.772 C:/Users/xx.log
123456-888-234567 0 0 12471665 1602834423.076 C:/Users/xx.log

Config:
----------

input {
  file {
    path => "C:/Users/xx.log"
  }
}

filter {
  mutate {
    remove_field => ["@timestamp", "@version", "path", "host"]
  }
}

output {
  kafka {
    bootstrap_servers => "broker"
    codec => plain { format => "%{message}" }
    topic_id => "topic" 
   }
}

If the file is rotated the newly created file will have a new inode (or its Windows equivalent) so a new sincedb entry will be created. The name is not the primary identifier of the file, the inode is.

1 Like

@Badger ok. so you say it will refer to the latest inode(which is second entry) and get the latest offset and read the new records. The rotated logs files are renamed as xx.log.2020-10-12 etc... for the first entry, even after the rename happened to xx.log.2020-10-12, will it read again the same file and send data to kafka. I ask this since I see dupes in kafka, so.

Determining whether a rotated log file is a previously read file or not is an extremely hard problem. Doing checksums of the files would solve the problem, but it is rather expensive. The file input implements a far cheaper solution that usually works, but sometimes it will ignore a file it should read and sometime it will re-read a rotated log file.

There are several open issues that discuss use cases it gets wrong and possible solutions. Some of those issues have been open for several years and I do not forsee them getting addressed any time soon.

in prod, I use config file and use nssm to run logstash. I dont use logstash.yml. will this be cause for any duplicates or for reading old files repeatedly.

I do not think so.

in my test environment, whenever I rotate log file, new entry created in sincedb.

In prod, I dont see that way. the log files get rotated daily in past 7 days (we started logstash on 10/11). but we see only two entries like below. in kafka we are seeing still data from 10/12 or 10/13 and we dont recieve latest ones. so not sure how to debug this. we have more than 10 servers where logstash installed and routed to kafka. and in most of the servers we still see 10/13 data being sent and not latest ones.

123456-789-012345 0 0 192463431 1602474319.772 C:/Users/xx.log
123456-888-234567 0 0 12471665 1602834423.076 C:/Users/xx.log

Is it possible that prod has sincedb_clean_after set to a lower value?

To avoid repeatedly reading very old files the ignore_older option might help.

You may be able to filter out old events using something like this. Although I doubt you can prevent the previous day's file being immediately re-read just after rotation.

Thanks. I didnt set sincedb_clean_after in my config. let me check on that.

I wanted to check on the volume where we have 900,000 (900K logs) for 24 hr period spread out in 12 servers where logstash installed in all 12. logstash should be able to handle this volume right.

If there are 900,000 files per day across 12 servers that is nearly a file per second. Whether those servers will be able to do that depends on the size of the files, the complexity of the processing, the power of the servers, and the power (mainly CPU and IO) of the systems that the results are sent to.

Nobody can answer that question for you. Test and tune...

ok sure - what I meant is 900,000 rows of data/records spread out across all servers and I don't do any processing. just take those records and push it to kafka. Number from all those 10 servers per day might be aound 800k - 900k records.

800,000 records per day would be less than 10 records per second. I would expect even a single small computer to be able to do that.

from logstash which is running on 10 servers, I send data to one topic which has 5 partitions. at this point I see only data recieved from 5 servers. even though other 5 servers have data, I dont see it is coming. do I need to have 10 partitions for that one kafka topic.

To test, I run from only one server now. After log rotation, I still see sincedb refers to old inode and I dont see data coming from newly rotated log file.

@Badger - As I explained my issue earlier, I have same inode even after log rotation happens. Issue I see is it reads message from older file and sends data.

so

  1. I stopped logstash
  2. delete sincedb file
  3. added start_position => beginning to config file and started.

Now it started reading from the top. I need to check tonight what happens when log rotation happens.

Also in our connection, we use kafka with kerberos auth. so for every record we get from logs, logstash need to make SASL_SSL connection to kafka. Do you see this as an overhead.

If the same situation arise tonight(reading older files again), do you suggest to get the timestamp(inside the message we have in yyyyxxmm format). Use that and filter the events which are older than 1 day. Data is in csv format. do i need to regex and get the timestamp and process it.