If I have a CSV file that adds additional rows every 24 hours, is there a way that I can have ELK upload only the additional rows? I need a system in which I can keep changing the CSV file and not have ELK upload everything in that file from start.
are the new rows being appended to the file, or is the file being replaced with a new file that has more lines?
If the former, then the file input plugin should work for you just fine without any special configuration; it keeps track of the file, its inode, and how far it has read in order to avoid re-emitting lines that it has already processed.
The file is being replaced with another file with more lines in it.
I don't understand what you mean by the former option. If I edit a CSV using the vim command line and add rows to it. Isn't that the same as the entire file being replaced?
It depends on how vim is configured; this answer may be helpful.
You can concatenate data onto the tail of a file's existing inode on the command-line using the double shovel operator (>>); suppose you have a data.csv containing all of your existing data, and a new-lines.csv that contains some new lines:
cat new-lines.csv >> data.csv
caveat: in the above secenario, both files must have a trailing newline in order to work repeatably.
When testing new things, I tend to use the stdout output plugin using the rubydebug codec (which splats out all of the fields in an event in a somewhat human-readable format), which enables me to see what Logstash is doing.
The following pipeline configuration uses the File Input to read lines from a file, the CSV Filter to extract the CSV data, and an STDOUT Output with a RubyDebug Codec to output events to stdout:
input {
file {
path => "/path/to/your/data.csv"
start_position => "beginning"
}
}
filter {
csv {
# ...
}
# add other filters here to mutate your data
}
output {
stdout {
codec => rubydebug
}
}
Then I would start up Logstash and leave it running in a screen session or a separate console tab; at this point I should observe it process the existing lines and wait for more. Then, I would append lines to the data.csv using the method I wanted to test and observe whether Logstash processed only the new lines or if it re-processed the old lines.
Warning: the File input keeps track of where it left off across restarts using a persistent checkpoint file called a sincedb (relevant docs here)
It depends. If you are using an output that is capable of deduping on id (e.g., Elasticsearch, JDBC), you can use the fingerprint filter plugin to generate a consistent id from each line. Logstash would then reprocess the lines from before the edit, but your data store would be able to ensure that you don't end up with duplicates.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.