Improving fingerprint filter performance

vasu01 · April 19, 2021, 11:22am

We are using the logstash fingerprint filter to avoid duplicate data in elasticsearch. But the data ingestion is taking more time.

For Example - A file having ~15000-20000 rows takes approx 2~3 hours to load. Only two fields were given in the source section of the filter.

Here is my configuration:

fingerprint {
             key => "FINGERPRINT"
             method => "MD5"
             source => ["FILED_1","FIELD_2"]
             target => "document_hash_value"
             concatenate_sources => true
}

Is there a way to reduce the ingestion time? Thanks in advance.

Badger · April 19, 2021, 3:00pm

If you are only able to ingest around two events per second I very much doubt that the problem is in logstash. Try changing the output from elasticsearch to stdout or dots and see what throughput you get then. If it is much higher then the problem is not in the fingerprint filter.

MD5 was deprecated 25 years ago. I would suggest you change that to SHA256 (not SHA1 which has been deprecated for 10 to 15 years, depending on whose recommendations you follow).

vasu01 · April 20, 2021, 11:08am

Thanks for the inputs. I will try these out!

vasu01 · April 26, 2021, 12:21pm

@Badger Looks like the issue is not with the fingerprint filter. We have an elasticsearch filter plugin defined which queries es for every record and add a few fields. The fingerprint filter is quick and not the culprit.

Now, Is there any way we can increase the performance of the es filter plugin? I couldn't see any performance-related attributes in the docs.

Also, I noticed that the documents are getting ingested in 250 events per batch. What attribute would increase this setting?

Badger · April 26, 2021, 5:12pm

pipeline.batch.size is set to 125 by default.

What is the network latency between logstash and elasticsearch?

system · May 24, 2021, 5:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.