Update CSV file changes to ELK

Hello,

If I have a CSV file that adds additional rows every 24 hours, is there a way that I can have ELK upload only the additional rows? I need a system in which I can keep changing the CSV file and not have ELK upload everything in that file from start.

Thank you!

are the new rows being appended to the file, or is the file being replaced with a new file that has more lines?

If the former, then the file input plugin should work for you just fine without any special configuration; it keeps track of the file, its inode, and how far it has read in order to avoid re-emitting lines that it has already processed.

The file is being replaced with another file with more lines in it.

I don't understand what you mean by the former option. If I edit a CSV using the vim command line and add rows to it. Isn't that the same as the entire file being replaced?

Thanks

How can I test the former method out with the file input plugin?

It depends on how vim is configured; this answer may be helpful.

You can concatenate data onto the tail of a file's existing inode on the command-line using the double shovel operator (>>); suppose you have a data.csv containing all of your existing data, and a new-lines.csv that contains some new lines:

cat new-lines.csv >> data.csv

caveat: in the above secenario, both files must have a trailing newline in order to work repeatably.


When testing new things, I tend to use the stdout output plugin using the rubydebug codec (which splats out all of the fields in an event in a somewhat human-readable format), which enables me to see what Logstash is doing.

The following pipeline configuration uses the File Input to read lines from a file, the CSV Filter to extract the CSV data, and an STDOUT Output with a RubyDebug Codec to output events to stdout:

input {
  file {
    path => "/path/to/your/data.csv"
    start_position => "beginning"
  }
}

filter {
  csv {
    # ...
  }
  # add other filters here to mutate your data
}

output {
  stdout {
    codec => rubydebug
  }
}

Then I would start up Logstash and leave it running in a screen session or a separate console tab; at this point I should observe it process the existing lines and wait for more. Then, I would append lines to the data.csv using the method I wanted to test and observe whether Logstash processed only the new lines or if it re-processed the old lines.


Warning: the File input keeps track of where it left off across restarts using a persistent checkpoint file called a sincedb (relevant docs here)

Thank you so much.
If the file is being replaced/overwritten, is there a way I can achieve the same results?

if there is no solution by logstash you can do something like this.

save your last file with data.csv.old1 and then when you get new file data.csv delete all the record present in old file.

use sed/awk etc.. to remove everything from new csv file (which is present in old file). and leftover file is just new entry

It depends. If you are using an output that is capable of deduping on id (e.g., Elasticsearch, JDBC), you can use the fingerprint filter plugin to generate a consistent id from each line. Logstash would then reprocess the lines from before the edit, but your data store would be able to ensure that you don't end up with duplicates.

fingerprint or you can create your own document_id to avoid duplicate.

For example for one of my index. I created this uniq id.
Where projectname, systemtypeid and username combination will be unique forever.

filter {
mutate {
add_field => {
"doc_id" => "%{projectname}%{systemtypeid}%{username}"
}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.