Finally I incorporated the ruby part into aggregation. Thanks @steffens for help.
Index time compared to single line insert vs multiline (filebeat) is
* {logstash aggregation} aggreg_test rows:3804206 size:1.7gb indextime: 1,8h inserttime: 3h 20min
* {all lines no aggreg } single_test rows:74326466 size:23.8gb indextime: 12,4m inserttime: 2h 45min
* {multiline filebeat} multi_test rows:3802915 size:2.8gb indextime: 6,8m inserttime: 14min
aggregation provides the most (all is indexed) and least storage however it makes logstash busy the longest. more than 3 hours compared to multiline option 14minut to insert the same file into elasticsearch.
Is the aggregation I did correct? Do you have any help why is it so slow?
Aggregation:
filter {
dissect {
mapping => {"message" => "%{ID},%{key},%{value}"}
}
aggregate {
task_id => "%{ID}"
code => "
map['ID'] = event.get('ID')
map['profile'] ||= []
map['profile'] << {event.get('key') => event.get('value')}
event.cancel()
"
push_previous_map_as_event => true
timeout => 0
#remove_field => ["key", "value"]
}
}
this aggregation provides the result:
@timestamp February 6th 2019, 15:42:01.095
t @version 1
t ID 3201235446654
t _id ea5Cw2gBEQ455fveKMOJ
t _index aggreg_test
# _score -
t _type doc
? profile {
"37": "21"
},
{
"2001001": "83793503"
},
{
"2003000": "0"
},
{
"2003001": "1"
},
{
"2003002": "0"
},
{
"2003003": "2"
},
{
"2003004": "4"
},
{
"2003005": "4"
},
{
"2003006": "5"
},
{
"2003007": "5"
},
{
"2003008": "4"
},
{
"2003009": "3"
},
{
"2003010": "3"
},
{
"2003011": "3"
},
{
"2003012": "3"
},
{
"2003013": "2"
},
{
"2003014": "2"
},
{
"2003015": "1"
},
{
"2003017": "0"
},
{
"2004305": "A"
},
{
"2004306": "20190212154008"
},
{
"2005000": "Na_Feature"
},
{
"2005001": "ET"
},
{
"2005002": "IntNa_Feature"
},
{
"2006505": "1812"
},
{
"2006506": "72000000"
},
{
"2006507": "600000000"
},
{
"2020151": "0"
},
{
"2020342": "-2147483647"
}