Optimize logstash filter plugin with million lines of dictionary look up

ronchav · January 18, 2019, 2:44am

Hi,

I am implementing data masking which is based in a dictionary lookup. Currently there are four dictionary files (total of ~1.2 million lines) reference to translate my greedy message.

When transformation runs using translate plugin (four individual translate plugin mapped to each dictionary), execution and transformation of each line of the log file is taking ~25-30 secs. each, which is too high.

Can you please advise how to optimize the data transformation? I don't want to reinvent the wheel but in case you have any idea or somebody also faced this performance issue. Thank you in advance.

warkolm · January 18, 2019, 3:01am

What about putting the data into an index and then using the Elasticsearch filter to query it?

Christian_Dahlqvist · January 18, 2019, 6:07am

What does your data and config look like? Can you show some sample dictionary records?

ronchav · January 18, 2019, 8:57am

@warkolm: Technically possible but we cannot do that, sensitive data shouldn't reach elastic engine. At the ETL data integration layer it should be masked before forwarding to elastic. Thanks

ronchav · January 18, 2019, 9:04am

My config looks like this:

if [message] =~ /\d/ {
mutate {
gsub => ["message", "\d", "#"]
add_tag => "Masked"
}
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary1.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary2.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary3.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary4.csv"
fallback => "NoMatch: %{message}"
}

if "NoMatch" in [message] {
mutate { gsub => ["message", "NoMatch: ", ""] }
} else {
if !("Masked" in [tags]) {
mutate {
add_tag => "Masked"
}
}
}

Sample dictionary:
dictionary1.csv
Christian, C#######n
Jayson, J####n
.
.
.
etc... to thousand of lines

Christian_Dahlqvist · January 18, 2019, 10:56am

That sounds like a very expensive brute force way to address the problem. Is there any pattern to the words/phrases you are masking? If not I suspect it would be more efficient to create a custom plugin that loads the full dictionary and the processes the message word by word comparing it to the dictionary and generating a new, updated message based on the matched data.

guyboertje · January 18, 2019, 12:54pm

We are releasing the memcached filter as part of Logstash 6.6.0 in 3 days time. You can install it separately on earlier LS versions though.

bin/logstash-plugin install logstash-filter-memcached

You will have to run a memcached daemon though and pre-load it with the KV data - make sure you understand the expiry side of things.

github.com

RedisLabs/memcache_populator/blob/master/mcpopulator.py

#!/usr/bin/python
import argparse
import socket
import sys
    
# We can't use regular python-memcache since it doesn't enable explicit flag setting. Implememnt my own.
class MemcacheRaw:
    def __init__(self, addr='localhost', port=11211, timeout=None):
        self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.sock.settimeout(timeout)
        self.sock.connect((addr, port))
        
    def set(self, key, val, flags=0, expiry=0):
        self.sock.sendall('set %s %d %d %d\r\n%s\r\n'%(key, flags, expiry, len(val), val))
        if not self.sockExpect('STORED\r\n'):
            raise Exception('Failed to store key %s'%key)
            
    def sockExpect(self, expect):
        buf = ''
        while True:

This file has been truncated. show original

warkolm · January 18, 2019, 11:16pm

It'd be better if you create a new topic for this question

ronchav · January 21, 2019, 9:24am

HI Christian, there is no such pattern as this masking will happen in the greedy message. I wrote a Java program to perform the masking and it is doing it well in terms of performance. I only need to call it from Ruby as custom filter plugin. Will this be fine? Any thoughts on this

Christian_Dahlqvist · January 21, 2019, 9:32am

I believe there is a new Java API that can be used to create plugins, so you might be able to convert your code into a filter plugin. Not sure how well documented this is or whether it has been finalised. Maybe @guyboertje or someone else from the Logstash team knows?

guyboertje · January 21, 2019, 9:36am

We are releasing experimental support for plugins written in Java in 6.6.0 in the not too distant future. There will be a specific blog post by Dan Hermann explaining about the Java Plugin API.

ronchav · January 22, 2019, 6:41am

Hi @warkolm, another approach I can think of is if we let the data flow in to elastic and then restrict the field to the user and recreate a new masked field using ingest pipeline processors. But I am not sure if in the ingest processors that elastic have is supporting lookup from external dictionary reference. Any idea? Thanks

ronchav · January 22, 2019, 6:42am

Thanks, we can check on this in the future but we decided to go to production with 6.3 version.

leandrojmp · January 29, 2019, 4:28pm

I have a similar use case, I'm trying to use a dictionary with ~ 19 millions lines and 700 MB, but I'm still looking on how to implement it, since loading it on memory as a .yml file does not seem as a good idea.

The memcached filter is already available? I saw that 6.6.0 was launched today, but didn't find any information about a memcached filter.

guyboertje · January 29, 2019, 5:54pm

We are still in the process of releasing 6.6.0. Download artifacts are up but docs and blog posts are in progress.

In the meantime...

https://www.elastic.co/guide/en/logstash-versioned-plugins/current/v0.1.1-plugins-filters-memcached.html

The memcached filter plugin is installable now on 6.5.4 etc.

leandrojmp · January 29, 2019, 6:05pm

Thanks! I will try that!

system · February 26, 2019, 6:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Translate filter with +2.5 million dictionary entries Logstash	1	336	February 28, 2020
Slow performance of Logstash elasticsearch filter plugin Logstash	5	1472	October 29, 2019
Optimizing process time Logstash	2	296	March 13, 2020
Using translate plugin to speed up grok? Logstash	8	1862	July 6, 2017
Perfomance/Best Practice Question: Translate plugin vs ES Plugin Logstash	3	286	April 3, 2019

Optimize logstash filter plugin with million lines of dictionary look up

Related topics