Random number of messages dropping on logstash import

Paul_Shriner · October 10, 2020, 9:13pm

I have Logstash 7.6 running on Ubuntu 20.04. I am importing a pretty simple CSV file. The data in the file is separated by commas. See below:

    email,send_time,engaged,last_touch_utc,score,frequency,open_count,click_count,codes
    VlVSUlNESndNa1F4YmtKYWRWZGFVZz09/2wlyJIaCuyb/DLM16QSCRXIsoig6YQpWjR479tY8vM=,yahoo.com,86,1,2020-08-26 07:51:19 PM,50,0.5,3,6,448140|446120|454113
    VVhnMWFVaFhNbWRRU21oa09HazRaQT098M+44dupuzateOzZgtW1siqZcCn97z5P30cDiljVA54=,gmail.com,86,1,2020-10-04 04:36:33 PM,41,5,112,19,

Linecount of the import has about 1 million records.

The config file is pretty simple:

input {
        tcp {
            port => 5000
        }
        file { 
            path => '/var/ao/FileDrop/ecsv/*.csv'
            mode =>  'read'
            start_position => 'beginning'
            file_completed_action => 'delete'
	}
    }
    filter {


        csv {
            columns => ['email','domain','send_time','engaged','goobrt','score','frequency','open_count','click_count','codes']
            separator => ','
            skip_header=> "true"
        }

#	if( [goobrt] != "n/a") {
       	    date {
                match => ["goobrt", "yyyy-M-d H:mm:ss a"]
                timezone => "UTC"
                target => "last_touch_utc"
            }
#	}

        mutate {
            gsub=> [ "path","^.*\/","" ]
            gsub=> [ "path",".out.csv","" ]
        }

        mutate {
             copy => { "path" => "[@metadata][indexName]" }
        }


    ruby {
    code => "
      wanted_fields = ['path','email','send_time','engaged','last_touch_utc','score','domain','frequency','open_count','click_count','codes']      
        event.to_hash.keys.each { |k| 
        event.remove(k) unless wanted_fields.include? k
      }
    "
    }

}
output {
    elasticsearch {
        hosts => 'localhost:9200'
#	index => "ecsv"
        index=> "ecsv-%{[@metadata][indexName]}"
    }
}

I have imported it in debug mode, which didn't yield a lot of information. I had a lot of these lines:

[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Running csv filter {:event=>#<LogStash::Event:0x6bac85b6>}
[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Event after csv filter {:event=>#<LogStash::Event:0x6bac85b6>}

Which after looking at the ruby code for the filter is not concerning at all. I also had these lines in the file.

[DEBUG] 2020-10-10 18:48:01.510 [[main]<file] file - Received line {:path=>"/var/ao/FileDrop/ecsv/3f0faf5f-1c1a-0a73-4751-2b266073e136.out.csv", :text=>"TlZWbU1VcDRUREk1YWpWcmRrVlBhUT0981FXHh6MObMq+4B85WRP0j+Wze04F+gAjMa+rZ0wCr0=,outlook.com,86,1,2020-06-23 01:57:01 AM,45,0.7,219,12,443142|446120|511210|448120\r"}

Nothing in the log indicated that records were being dropped. There were no errors. But imports a variable number of records each time with the same file.

Import 1: 440,668
Import 2: 211,840
Import 3: 417,153
Import 4: 740,142

I have tried it with different files arriving at the same out come with messages are being dropped. This is a single node instance. There are 8 cores. I have tried varying 1-8 workers. I have adjusted the batch size, but still messages are being dropped.

The input files are generated in our analytics cluster and as such the format is controlled.
The disk is 50% full.
500-ish indices on server

I AM STUMPED and about 24 hours into debugging this thing. I have tried everything I know to check.

Thought: The last field of the CSV is call "codes" sometimes it has values, sometimes it does not. There does not seem to be a correlation to the records that are kept/dropped.

Help.

Badger · October 10, 2020, 9:23pm

I would start by using a file output, see if all 1 million records get as far as the output section.

Do you have any ecsv-* indexes you do not expect? Could the metadata be wrong?

Paul_Shriner · October 10, 2020, 9:44pm

OK, that is good idea in regards to the file output.

The indexes are set based on the inbound file name. I did check and I don't see any new "Surprise" indices.

Paul_Shriner · October 10, 2020, 10:02pm

We have crossed file output off the list of possible solutions.

File output had 569,336 lines in it.

I did not run it in debug mode, but do have a similar outcome.

Badger · October 10, 2020, 10:14pm

Next I would comment out the ruby filter and see if that fixes the problem.

Paul_Shriner · October 10, 2020, 10:20pm

Ruby code commented out.

647,444 records

Badger · October 10, 2020, 10:57pm

Obviously if you are seeing DEBUG messages you could not have missed 300,000 mapping exceptions in the logstash log, but is elasticsearch logging anything of interest? I am about out of suggestions here.

Paul_Shriner · October 11, 2020, 12:36am

I don't think so, AND this is not new code, It has been running wtihout issue for the last 6-ish months.

This led me to the conclusion that it may some way related to the environment. We have ramped up usage and then suddenly started to see events drop.

On here, we were able to isolate this to being a Logstash and not ES issue.

From an environmental perspective, this is RAM usage.

                    total        used        free      shared  buff/cache   available
Mem:           7970        4205         243           7        3522        3462
Swap:          1023         166         857

Because it is a small machine, we allocated 2G to LS

## JVM configuration

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms2g
-Xmx2g

^--- Does that look correct?

Badger · October 11, 2020, 1:00am

I routinely run logstash with a 400 MB heap. But that is with very simple pipelines. If it ran out of memory I would expect exceptions to be logged.

Are all the records in the csv? Why do you have a tcp input? Could things be backed up enough that even a tcp input is losing data?

Paul_Shriner · October 11, 2020, 1:20am

Everything is coming in from the CSV file.

The only reason that that TCP input is there is because when I first wrote this file a year ago, the example included it. #embarassed

I just removed it and re-ran

539,416 records

So, while a cleaner config file, not the solution.

I have a pre-processor that i run to encrypt email addresses. I could wipe out "n/a" issue and then it would be a straight import.

Paul_Shriner · October 11, 2020, 1:25am

How do I discover the LS backlog?

I just found this thread, while older, I have been playing with batch size (100 - 3000 )

That seems to be a dead end as well.

Paul_Shriner · October 11, 2020, 1:38am

Could queue type be a factor?

Paul_Shriner · October 11, 2020, 2:03am

Think that you were right @Badger, It was the backpressure! I put a 50 ms delay between batches and got the full file import.

1,055,181

Who has two thumbs and is excited that this particular nightmare is over???

This guy!

Paul_Shriner · October 11, 2020, 3:52am

It worked once, and now back to the same place.

ARG

Paul_Shriner · October 11, 2020, 2:56pm

OK, so far today:

Commented out entire conf file and still output to file
Turned on persistent queue
Cut number of workers in half

Still a variable output for line count on the output file

Paul_Shriner · November 5, 2020, 3:41am

So interesting update, I rewrote the ingestion using the Batch API. I was ingesting a file of 74MM records and it randomly dropping events. Sees like something other that LS...

Paul_Shriner · November 5, 2020, 5:18pm

To provide a little more detail, I was attempting to import 74MM through the batch API, only 60 million showed in the index. I dropped the index and reimported, this time 68 Million were imported, but still not 74. When I walk the reply for logging, no error are created. I pass all 74 MM records in, but still the drop.

Because this is happening in both simplified LS and the Batch API, it leads me to the conclusion it is likely a resource constrained server. Hard disk is 82%, and logs can be trimmed and some indices could be dropped.

SWAP is at 98%, which is concerning.

LS is no longer running on the box.

There are not a lot of operations running on the box at once.

Help.

system · December 3, 2020, 5:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Losing messages during high traffic rate Logstash	5	4707	July 6, 2017
When will LogStash exceed the queue capacity and drop messages? Elasticsearch	2	1038	July 6, 2017
Logstash is losing messages Logstash	4	2000	July 6, 2017
Logstash - Losing packets Logstash	2	392	February 16, 2020
Logstash dropping events not accepting connections Logstash	3	797	August 10, 2017

Random number of messages dropping on logstash import

Related topics