Random number of messages dropping on logstash import

I have Logstash 7.6 running on Ubuntu 20.04. I am importing a pretty simple CSV file. The data in the file is separated by commas. See below:

    email,send_time,engaged,last_touch_utc,score,frequency,open_count,click_count,codes
    VlVSUlNESndNa1F4YmtKYWRWZGFVZz09/2wlyJIaCuyb/DLM16QSCRXIsoig6YQpWjR479tY8vM=,yahoo.com,86,1,2020-08-26 07:51:19 PM,50,0.5,3,6,448140|446120|454113
    VVhnMWFVaFhNbWRRU21oa09HazRaQT098M+44dupuzateOzZgtW1siqZcCn97z5P30cDiljVA54=,gmail.com,86,1,2020-10-04 04:36:33 PM,41,5,112,19,

Linecount of the import has about 1 million records.

The config file is pretty simple:

input {
        tcp {
            port => 5000
        }
        file { 
            path => '/var/ao/FileDrop/ecsv/*.csv'
            mode =>  'read'
            start_position => 'beginning'
            file_completed_action => 'delete'
	}
    }
    filter {


        csv {
            columns => ['email','domain','send_time','engaged','goobrt','score','frequency','open_count','click_count','codes']
            separator => ','
            skip_header=> "true"
        }

#	if( [goobrt] != "n/a") {
       	    date {
                match => ["goobrt", "yyyy-M-d H:mm:ss a"]
                timezone => "UTC"
                target => "last_touch_utc"
            }
#	}

        mutate {
            gsub=> [ "path","^.*\/","" ]
            gsub=> [ "path",".out.csv","" ]
        }

        mutate {
             copy => { "path" => "[@metadata][indexName]" }
        }


    ruby {
    code => "
      wanted_fields = ['path','email','send_time','engaged','last_touch_utc','score','domain','frequency','open_count','click_count','codes']      
        event.to_hash.keys.each { |k| 
        event.remove(k) unless wanted_fields.include? k
      }
    "
    }

}
output {
    elasticsearch {
        hosts => 'localhost:9200'
#	index => "ecsv"
        index=> "ecsv-%{[@metadata][indexName]}"
    }
}

I have imported it in debug mode, which didn't yield a lot of information. I had a lot of these lines:

[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Running csv filter {:event=>#<LogStash::Event:0x6bac85b6>}
[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Event after csv filter {:event=>#<LogStash::Event:0x6bac85b6>}

Which after looking at the ruby code for the filter is not concerning at all. I also had these lines in the file.

[DEBUG] 2020-10-10 18:48:01.510 [[main]<file] file - Received line {:path=>"/var/ao/FileDrop/ecsv/3f0faf5f-1c1a-0a73-4751-2b266073e136.out.csv", :text=>"TlZWbU1VcDRUREk1YWpWcmRrVlBhUT0981FXHh6MObMq+4B85WRP0j+Wze04F+gAjMa+rZ0wCr0=,outlook.com,86,1,2020-06-23 01:57:01 AM,45,0.7,219,12,443142|446120|511210|448120\r"}

Nothing in the log indicated that records were being dropped. There were no errors. But imports a variable number of records each time with the same file.

Import 1: 440,668
Import 2: 211,840
Import 3: 417,153
Import 4: 740,142

I have tried it with different files arriving at the same out come with messages are being dropped. This is a single node instance. There are 8 cores. I have tried varying 1-8 workers. I have adjusted the batch size, but still messages are being dropped.

  • The input files are generated in our analytics cluster and as such the format is controlled.
  • The disk is 50% full.
  • 500-ish indices on server

I AM STUMPED and about 24 hours into debugging this thing. I have tried everything I know to check.

Thought: The last field of the CSV is call "codes" sometimes it has values, sometimes it does not. There does not seem to be a correlation to the records that are kept/dropped.

Help.

I would start by using a file output, see if all 1 million records get as far as the output section.

Do you have any ecsv-* indexes you do not expect? Could the metadata be wrong?

OK, that is good idea in regards to the file output.

The indexes are set based on the inbound file name. I did check and I don't see any new "Surprise" indices.

We have crossed file output off the list of possible solutions.

File output had 569,336 lines in it.

I did not run it in debug mode, but do have a similar outcome.

Next I would comment out the ruby filter and see if that fixes the problem.

Ruby code commented out.

647,444 records

Obviously if you are seeing DEBUG messages you could not have missed 300,000 mapping exceptions in the logstash log, but is elasticsearch logging anything of interest? I am about out of suggestions here.

I don't think so, AND this is not new code, It has been running wtihout issue for the last 6-ish months.

This led me to the conclusion that it may some way related to the environment. We have ramped up usage and then suddenly started to see events drop.

On here, we were able to isolate this to being a Logstash and not ES issue.

From an environmental perspective, this is RAM usage.

                    total        used        free      shared  buff/cache   available
Mem:           7970        4205         243           7        3522        3462
Swap:          1023         166         857

Because it is a small machine, we allocated 2G to LS

## JVM configuration

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms2g
-Xmx2g

^--- Does that look correct?

I routinely run logstash with a 400 MB heap. But that is with very simple pipelines. If it ran out of memory I would expect exceptions to be logged.

Are all the records in the csv? Why do you have a tcp input? Could things be backed up enough that even a tcp input is losing data?

Everything is coming in from the CSV file.

The only reason that that TCP input is there is because when I first wrote this file a year ago, the example included it. #embarassed

I just removed it and re-ran

539,416 records

So, while a cleaner config file, not the solution.

I have a pre-processor that i run to encrypt email addresses. I could wipe out "n/a" issue and then it would be a straight import.

How do I discover the LS backlog?

I just found this thread, while older, I have been playing with batch size (100 - 3000 )

That seems to be a dead end as well.

Could queue type be a factor?

Think that you were right @Badger, It was the backpressure! I put a 50 ms delay between batches and got the full file import.

1,055,181

Who has two thumbs and is excited that this particular nightmare is over???

This guy!

It worked once, and now back to the same place.

ARG

OK, so far today:

  • Commented out entire conf file and still output to file
  • Turned on persistent queue
  • Cut number of workers in half

Still a variable output for line count on the output file

So interesting update, I rewrote the ingestion using the Batch API. I was ingesting a file of 74MM records and it randomly dropping events. Sees like something other that LS...

To provide a little more detail, I was attempting to import 74MM through the batch API, only 60 million showed in the index. I dropped the index and reimported, this time 68 Million were imported, but still not 74. When I walk the reply for logging, no error are created. I pass all 74 MM records in, but still the drop.

Because this is happening in both simplified LS and the Batch API, it leads me to the conclusion it is likely a resource constrained server. Hard disk is 82%, and logs can be trimmed and some indices could be dropped.

SWAP is at 98%, which is concerning.

LS is no longer running on the box.

There are not a lot of operations running on the box at once.

Help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.