I want to store the line number as the document_id, but i saw that sometimes its not storing in the right way the numbers in order, i guess its because of the multiple workers, the question is Is there a way to store numbers in the right line without losing a lot of performance?
input {
file {
path => "c:/path/path//"
start_position => "beginning"
sincedb_path =>"NUL"
file_completed_action => "log_and_delete"
file_completed_log_path => "c:/path/log/log.log"
file_sort_by => "path"
mode => "read"
}
}
filter{
ruby { init => '@number = 0'
code => ' @number += 1
event.set("numLines", @number)' }
}
No, not possible, to get the number of line you would need the file to be processed sequentially and to do that you need to run the pipeline with just one worker, this may or may note impact the performance of this specific pipeline.
Depending on what you are consuming you may preprocess your files to add the line number on the line, this way you would have the line number on each line and could parse your message to get it.
Thats the only 2 options I have right?, I read everyday a file that may have repeated logs from other previous days, so I want to store that number as document_Id so in the case its repeated it wont be added again.
Could be an option to run 2 logstash instances, 1 for that type of file running with 1 worker, and the other one with x workers?, that would made me lose a lot of performance right?
If this is the only reason to have the line number you actually do not need it, you can use the fingerprint filter on some field to create a unique ID and then use this unique ID as the document id of the document.
Not sure how this would work, would you store in different indices? If not, how would this help the duplication case? Also, you are reading the file with log_and_delete.
I'm really grateful for your answers, sadly I think that the solutions doesn´t works in this case. By the way maybe I explained myself bad, the thing is that the log files may be repetied but with new inserted lines until it gets the maximum size to create another log, so we want to just store the new data, not the past data that was already stored, the fingerprint would be nice if it was different message but the message is the same than the previous twin.
Our Indexes works taking info from the line we reading, not by the document name we reading.
puched, ES is not a relational database as MySQL , the focus is on inverted indices, there is no an internal autoincrement id.
Can you show how does your data look like?
Not sure if I got what the issue is, the fingerprint filter is used when you have the same message or id and want to store the most recent, you can create a fingerprint based on the entire message field, which in your case would be the line that you are reading.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.