Logstash csv loading broken duplicate of last row

Hello hello,

Quite new to Elastic Stack, and I'm encountering a problem which also experienced colleagues can't figure our the reason for this event happening.

Bottom line - my data loads fine, except that almost every time i update the csv, it loads in addition to the csv a broken document starting randomly from the middle of the last row. So every hour I get 0-3 broken documents as such:

The document is built of 50 fields, and expects and input of a csv with 50 columns, the first being the TaskID which is either a 6/7 len intger, or a string built 'TM-####'.
TaskID shouldn't load empty, or with anything but the TaskID column.
The broken documents are almost always from the last row in the csv or the one before it. There are no unique characters, it isn't always in the same place (randomly chooses a start character and builds the message from there).

sometimes it loads just an empty document, based on the last character of column 49 and column 50 of the last row, so taskID remains empty and message is just:

image

It shouldn't even be able to load without being 50 columns long.

The most common row (with out sensitive info): TM-478, 2018-11-25, 2020-06-10, , Continuation of TASK 77212 freeze up sporadically for few seconds, , , , , , ,Diamond - High, , ,Assigned to Support, ,Management Products CFG, , , , , ,0, ,6.0, 96.0, , ,Diamond Americas 1-3, ,568.0 ,568.0 ,Yes, Management, , , , , ,2018, 11, , , ,Management Products, , , , Security Management Products, 0

I have a basic CSV loader configured as such (removed company info):


input {
  file {
  path => "C:\CFS\ELK\task_raw.csv"
   start_position => "beginning"
   sincedb_path => "/dev/null"
   codec => plain {
                    charset => "ISO-8859-1"
            }
  }
  
  file {
  path => "C:\CFS\ELK\jira_raw.csv"
   start_position => "beginning"
   sincedb_path => "/dev/null"
   codec => plain {
                    charset => "ISO-8859-1"
            }
  }
}


filter {
	csv {
		separator => ","
		columns => [50 columns]
	}
	
	 date {
		locale => "en"
        match => ["CreateDate","YYYY-MM-dd"]
		target => "@timestamp"
    }
	
	mutate {
	remove_field => [ "column","column"]
	convert => { "column1" => "float" }
	convert => { "column2" => "integer" }
	convert => { "column4" => "integer" }
	convert => { "column5" => "integer" }
	convert => { "column6" => "integer" }
	convert => { "column7" => "integer" }
	}
	
	if "X" in [column] { drop{}}
	
}

output {
  elasticsearch {
    hosts => "localhost:9400"
    manage_template => false
    index => "taskraw_final_data_3"
    document_type => "taskraw_final_data_3"
    document_id => "id_%{TaskId}"
  }
}

Any ideas on what might be the cause of the problem, or "workaround" solutions, would be highly appreciated!

Thank you,

Noam

Is there a newline at the end of the last line of the file?

On Windows, if you do not want the in-memory sincedb persisted across restarts you should set sincedb_path => "NUL". Also, do not use backslash in the path option of a file input, use forward slash.

Thanks for the prompt response Badger!

There is an extra line, happens when you export pandas dataframe to csv, could that be the problem? I'll lookup a way to remove it.

And noted regarding the forward / and sincedb

Hello,

Removed the extra line, problem persists..

Anybody have another idea?

Found the source of the problem - Python Pandas to_csv was "saving" the csv while writing it, and that confused logstah. Changing the script to copy the file only once it's done being written solved the problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.