Ruby filter counting error Workers

puched · February 23, 2023, 1:39pm

I want to store the line number as the document_id, but i saw that sometimes its not storing in the right way the numbers in order, i guess its because of the multiple workers, the question is Is there a way to store numbers in the right line without losing a lot of performance?
input {
file {
path => "c:/path/path//"
start_position => "beginning"
sincedb_path =>"NUL"
file_completed_action => "log_and_delete"
file_completed_log_path => "c:/path/log/log.log"
file_sort_by => "path"
mode => "read"
}
}
filter{
ruby { init => '@number = 0'
code => '
@number += 1
event.set("numLines", @number)' }
}

output {
elasticsearch {
hosts => ["localhost:9200"]
index => "index"
ssl => false
ilm_enabled => false
user =>'user'
document_id => "%{numLines}"
password=>'password'
}

stdout {
codec => line {format => "%{numLines}"}
}

}

leandrojmp · February 23, 2023, 1:57pm

No, not possible, to get the number of line you would need the file to be processed sequentially and to do that you need to run the pipeline with just one worker, this may or may note impact the performance of this specific pipeline.

Depending on what you are consuming you may preprocess your files to add the line number on the line, this way you would have the line number on each line and could parse your message to get it.

puched · February 24, 2023, 7:19am

Thats the only 2 options I have right?, I read everyday a file that may have repeated logs from other previous days, so I want to store that number as document_Id so in the case its repeated it wont be added again.

Could be an option to run 2 logstash instances, 1 for that type of file running with 1 worker, and the other one with x workers?, that would made me lose a lot of performance right?

leandrojmp · February 24, 2023, 1:17pm

If this is the only reason to have the line number you actually do not need it, you can use the fingerprint filter on some field to create a unique ID and then use this unique ID as the document id of the document.

Check this blog post with some examples.

Not sure how this would work, would you store in different indices? If not, how would this help the duplication case? Also, you are reading the file with log_and_delete.

puched · March 2, 2023, 7:52am

I'm really grateful for your answers, sadly I think that the solutions doesn´t works in this case. By the way maybe I explained myself bad, the thing is that the log files may be repetied but with new inserted lines until it gets the maximum size to create another log, so we want to just store the new data, not the past data that was already stored, the fingerprint would be nice if it was different message but the message is the same than the previous twin.

Our Indexes works taking info from the line we reading, not by the document name we reading.

Rios · March 2, 2023, 10:35am

puched, ES is not a relational database as MySQL , the focus is on inverted indices, there is no an internal autoincrement id.
Can you show how does your data look like?

leandrojmp · March 2, 2023, 12:51pm

Not sure if I got what the issue is, the fingerprint filter is used when you have the same message or id and want to store the most recent, you can create a fingerprint based on the entire message field, which in your case would be the line that you are reading.

system · March 30, 2023, 12:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get line number of the log file line being processed Logstash	4	5405	July 6, 2017
Unique identified per line during logstash Logstash	4	869	July 6, 2017
Using my own document_id - is there a faster way? Logstash	26	4320	December 27, 2017
Logstash produces duplicates Logstash	3	1186	July 6, 2017
Count of DeDuplicated Message from Logstash in ElasticSearch Logstash	1	669	July 6, 2017

Ruby filter counting error Workers

Related topics