Hi I have a 4 columns csv data to be proceed by Logstash and aggregating before sending to Elasticsearch. However, from my testing, I tune the JVM heap size up along with batch size to 50k, batch delay 1m. It stil took 1~2 minutes to go through 1.5 million rows in a 48MB file for aggregation. A simple Python script took only 0.5 seconds. Do I do anything wrong here?
below is my filter config and the input/output is just very simple.
filter {
csv {
separator => ","
skip_header => "true"
columns => ["id", "abc-id", "azimuth", "elevation"]
convert => {
"id" => "integer"
"azimuth" => "integer"
"elevation" => "integer"
}
}
aggregate {
task_id => "%{id}|%{azimuth}|%{elevation}"
code => "
map['count'] ||= 0; map['count'] += 1;
map['id'] = event.get('id')
map['azimuth'] = event.get('azimuth')
map['elevation'] = event.get('elevation')
"
push_map_as_event_on_timeout => true
timeout => 120 # You might want to increase this from 2 seconds
timeout_code => "
event.set('[@metadata][wanted]', true)
"
}
if ![@metadata][wanted] { drop {} } # Drop the raw events, just keep the aggregates
}