Logstash csv loading broken duplicate of last row

noamc · June 23, 2020, 3:54pm

Hello hello,

Quite new to Elastic Stack, and I'm encountering a problem which also experienced colleagues can't figure our the reason for this event happening.

Bottom line - my data loads fine, except that almost every time i update the csv, it loads in addition to the csv a broken document starting randomly from the middle of the last row. So every hour I get 0-3 broken documents as such:

The document is built of 50 fields, and expects and input of a csv with 50 columns, the first being the TaskID which is either a 6/7 len intger, or a string built 'TM-####'.
TaskID shouldn't load empty, or with anything but the TaskID column.
The broken documents are almost always from the last row in the csv or the one before it. There are no unique characters, it isn't always in the same place (randomly chooses a start character and builds the message from there).

sometimes it loads just an empty document, based on the last character of column 49 and column 50 of the last row, so taskID remains empty and message is just:

It shouldn't even be able to load without being 50 columns long.

The most common row (with out sensitive info): TM-478, 2018-11-25, 2020-06-10, , Continuation of TASK 77212 freeze up sporadically for few seconds, , , , , , ,Diamond - High, , ,Assigned to Support, ,Management Products CFG, , , , , ,0, ,6.0, 96.0, , ,Diamond Americas 1-3, ,568.0 ,568.0 ,Yes, Management, , , , , ,2018, 11, , , ,Management Products, , , , Security Management Products, 0

I have a basic CSV loader configured as such (removed company info):


input {
  file {
  path => "C:\CFS\ELK\task_raw.csv"
   start_position => "beginning"
   sincedb_path => "/dev/null"
   codec => plain {
                    charset => "ISO-8859-1"
            }
  }
  
  file {
  path => "C:\CFS\ELK\jira_raw.csv"
   start_position => "beginning"
   sincedb_path => "/dev/null"
   codec => plain {
                    charset => "ISO-8859-1"
            }
  }
}


filter {
	csv {
		separator => ","
		columns => [50 columns]
	}
	
	 date {
		locale => "en"
        match => ["CreateDate","YYYY-MM-dd"]
		target => "@timestamp"
    }
	
	mutate {
	remove_field => [ "column","column"]
	convert => { "column1" => "float" }
	convert => { "column2" => "integer" }
	convert => { "column4" => "integer" }
	convert => { "column5" => "integer" }
	convert => { "column6" => "integer" }
	convert => { "column7" => "integer" }
	}
	
	if "X" in [column] { drop{}}
	
}

output {
  elasticsearch {
    hosts => "localhost:9400"
    manage_template => false
    index => "taskraw_final_data_3"
    document_type => "taskraw_final_data_3"
    document_id => "id_%{TaskId}"
  }
}

Any ideas on what might be the cause of the problem, or "workaround" solutions, would be highly appreciated!

Thank you,

Noam

Badger · June 23, 2020, 4:52pm

Is there a newline at the end of the last line of the file?

On Windows, if you do not want the in-memory sincedb persisted across restarts you should set sincedb_path => "NUL". Also, do not use backslash in the path option of a file input, use forward slash.

noamc · June 23, 2020, 5:23pm

Thanks for the prompt response Badger!

There is an extra line, happens when you export pandas dataframe to csv, could that be the problem? I'll lookup a way to remove it.

And noted regarding the forward / and sincedb

noamc · June 24, 2020, 8:48am

Hello,

Removed the extra line, problem persists..

noamc · June 25, 2020, 12:17pm

Anybody have another idea?

noamc · July 9, 2020, 4:18pm

Found the source of the problem - Python Pandas to_csv was "saving" the csv while writing it, and that confused logstah. Changing the script to copy the file only once it's done being written solved the problem.

system · August 6, 2020, 4:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash inconsistency while reading csv data Logstash	9	510	January 30, 2024
Logstash CSV plugin behaviour for realtime CSV files Logstash	3	865	December 4, 2017
Unable to load csv file through logstash on windows Logstash	3	526	July 20, 2019
Csv load with logstash not working Logstash	3	1394	June 26, 2019
Logstash will continue unprocessed line from CSV file Logstash	1	442	April 23, 2018

Logstash csv loading broken duplicate of last row

Related topics