I have around 4.5Million records in my input data of logstash to which I am doing a lookup of an existing index in ES using following ES filter plugin. This is just like adding department information to a user_name field.
After this lookup, I am doing indexing of this complete data in a new index in ES.
If I comment es filter plugin ( i.e without department information) it takes about 5-6 minutes to load all input data in elasticsearch and with having this filter plugin, it 's not even completing in 40 minutes.
Does translate filter can be an alternative to this? Will it perform better than ES filter plugin if I translate ( lookup ) to a text file than an already indexed data?
This user_name to department kind of lookup is important for me.
An elasticsearch output makes one API call to elasticsearch for each batch of events. By default the batch size is 125. An elasticsearch filter makes one API call to elasticsearch for each event, so it is making 125 times as many calls. Thus it is not surprising to me that it would take more than 10 times as long.
I would expect a translate filter to be very much faster.
I am surprised when you said that a translate filter where the lookup file is stored on local disk will work faster than a ES query response in case of elasticsearch filter plugin. Because the disk IO throughput for ES cluster( due to parallelism) is multiple time higher than local disk of logstash server. May be I am wrong here.
Yes, I wasn’t aware of this in memory read of translate filter. However could you also tell how to workaround the lookup file rollover because overwriting/updating
the lookup on disk could cause issue during the time when file/inode is getting updated? Is there any parameter in logstash with translate filter which keeps the last in-memory read of file and refresh it in memory only on our command.
Just for example steps:
Previous lookup file loaded in memory of logstash.
Lookup file updated or replaced or overwritten
Refresh the new file in-memory of logstash by some schedule.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.