File input and file output line count mismatch

(Mihir Ray) #1

Input data:
Input data is a static file containing 10 lines of accesslogs.
If there is a way to share a file, please let me know, i can share the input file.


input {
file {
sincedb_path => "/opt/analytics/logstash/sincedb/da_neat.sincedb"
path => ["/opt/analytics/logstash/conf/neat_test"]
start_position => "beginning"
exclude => "*.gz"
type => "da_neat"

output {
codec => "plain"
path => "/opt/analytics/logstash/data/da_neat/da_neat02-%{+YYYY-MM-dd-HH}.json"

#1: With the above input and config i get 9 lines in output instead of 10. This is happening with any file. Not sure why it skips the last line always.
#2: After adding grok,urldecode and kv filters, i got 10 records which matches the input line count, which is good.
grok {
match => { "message" => "%{DATA:remote_addr} %{DATA:attr1} %{DATA:remote_user} [%{HTTPDATE:server_timestamp}] "%{DATA:attr2}" %{NUMBER:status} (?:%{NUMBER:body_bytes_sent}|-) "%{DATA:http_referer}" "%{DATA:http_user_agent}" "%{DATA:http_x_forwarded_for}" "%{DATA:request_body}"" }

    kv { source => "request_body"
            field_split => "&"

   urldecode {
            all_fields => "true"

#3: Then i added the date filter to match a key in the log. This results in first and seventh record missing in the output(total 8 records in output).

    date {
            match => [ "server_timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]

#4: Added mutate to remove unwanted fields:
mutate {
remove_field => [ "message","@version","type","host","path","request_body","attr1", "attr2", "status", "body_bytes_sent", "http_x_forwarded_for" ]
This further decreased the output line count to 6.

Can someone please help me understand this behavior of logstash.

Mihir Ray

(Magnus B├Ąck) #2

Are you deleting the sincedb file between each test run, or how are you getting Logstash to reprocess the file?

(Mihir Ray) #3

Yes, i am dropping the sincedb file between each run.

(Jordan Sissel) #4

The file output only flushes periodically, and only decides to flush after each write, which means if it flushes, and 2 new events are received but the flush interval hasn't expired, those 2 events won't be flushed.

Try observing using the stdout output instead, which doesn't buffer.

(system) #5