Optimize logstash filter plugin with million lines of dictionary look up

Hi,

I am implementing data masking which is based in a dictionary lookup. Currently there are four dictionary files (total of ~1.2 million lines) reference to translate my greedy message.

When transformation runs using translate plugin (four individual translate plugin mapped to each dictionary), execution and transformation of each line of the log file is taking ~25-30 secs. each, which is too high.

Can you please advise how to optimize the data transformation? I don't want to reinvent the wheel but in case you have any idea or somebody also faced this performance issue. Thank you in advance.

What about putting the data into an index and then using the Elasticsearch filter to query it?

What does your data and config look like? Can you show some sample dictionary records?

@warkolm: Technically possible but we cannot do that, sensitive data shouldn't reach elastic engine. At the ETL data integration layer it should be masked before forwarding to elastic. Thanks

My config looks like this:

if [message] =~ /\d/ {
mutate {
gsub => ["message", "\d", "#"]
add_tag => "Masked"
}
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary1.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary2.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary3.csv"
fallback => "NoMatch: %{message}"
}

translate {
field => "message"
destination => "message"
override => true
exact => false
dictionary_path => "dictionary4.csv"
fallback => "NoMatch: %{message}"
}

if "NoMatch" in [message] {
mutate { gsub => ["message", "NoMatch: ", ""] }
} else {
if !("Masked" in [tags]) {
mutate {
add_tag => "Masked"
}
}
}

Sample dictionary:
dictionary1.csv
Christian, C#######n
Jayson, J####n
.
.
.
etc... to thousand of lines

That sounds like a very expensive brute force way to address the problem. Is there any pattern to the words/phrases you are masking? If not I suspect it would be more efficient to create a custom plugin that loads the full dictionary and the processes the message word by word comparing it to the dictionary and generating a new, updated message based on the matched data.

We are releasing the memcached filter as part of Logstash 6.6.0 in 3 days time. You can install it separately on earlier LS versions though.

bin/logstash-plugin install logstash-filter-memcached

You will have to run a memcached daemon though and pre-load it with the KV data - make sure you understand the expiry side of things.

It'd be better if you create a new topic for this question :slight_smile:

HI Christian, there is no such pattern as this masking will happen in the greedy message. I wrote a Java program to perform the masking and it is doing it well in terms of performance. I only need to call it from Ruby as custom filter plugin. Will this be fine? Any thoughts on this

I believe there is a new Java API that can be used to create plugins, so you might be able to convert your code into a filter plugin. Not sure how well documented this is or whether it has been finalised. Maybe @guyboertje or someone else from the Logstash team knows?

We are releasing experimental support for plugins written in Java in 6.6.0 in the not too distant future. There will be a specific blog post by Dan Hermann explaining about the Java Plugin API.

Hi @warkolm, another approach I can think of is if we let the data flow in to elastic and then restrict the field to the user and recreate a new masked field using ingest pipeline processors. But I am not sure if in the ingest processors that elastic have is supporting lookup from external dictionary reference. Any idea? Thanks

Thanks, we can check on this in the future but we decided to go to production with 6.3 version.

I have a similar use case, I'm trying to use a dictionary with ~ 19 millions lines and 700 MB, but I'm still looking on how to implement it, since loading it on memory as a .yml file does not seem as a good idea.

The memcached filter is already available? I saw that 6.6.0 was launched today, but didn't find any information about a memcached filter.

1 Like

We are still in the process of releasing 6.6.0. Download artifacts are up but docs and blog posts are in progress.

In the meantime...

https://www.elastic.co/guide/en/logstash-versioned-plugins/current/v0.1.1-plugins-filters-memcached.html

The memcached filter plugin is installable now on 6.5.4 etc.

1 Like

Thanks! I will try that!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.