Hi everyone, we need some advice.
We have an index with Kafka data source. Our goal is to enrich this index with information present on elasticsearch, that is updated every day. We need to do enrichment in real time.
To do this, we created a Logstash pipeline that queries elasticsearch with the elasticsearch filter and we inserted the memcached filter to speed everything up and avoid unnecessary queries to elasticsearch.
Below is an example of the structure used for one of the enrichments:
# enrichment from REGION registry data
# I use the COD_REGN field as the correlation key
# retrieve the DESC_REGN field
memcached {
id => "memcached_get_DESC_REGN"
hosts => ["ip:port"]
get => { "%{COD_REGN}_DESC_REGN" => "DESC_REGN"}
}
if ![DESC_REGN] {
elasticsearch {
hosts => [ "https://hostname:9200"]
index => "index_name"
query => 'COD_REGN: "%{[COD_REGN]}"'
result_size => 1
fields => {
"DESC_REGN" => "DESC_REGN"
}
user => "user"
password => "PWD"
ca_file => "/path/to/file.pem"
tag_on_failure => [ "failed_anagrafica_region"]
}
memcached {
hosts => ["ip:port"]
set => { "DESC_REGN" => "%{COD_REGN}_DESC_REGN" }
}
}
Is there anything else we can do to optimize the pipeline? As we have a delay of 3 hours due to the numerous queries to elasticsearch and the large amount of data.
The elastic stack used is version 7.2
Thank you!!
Rosanna