I am implementing data masking which is based in a dictionary lookup. Currently there are four dictionary files (total of ~1.2 million lines) reference to translate my greedy message.
When transformation runs using translate plugin (four individual translate plugin mapped to each dictionary), execution and transformation of each line of the log file is taking ~25-30 secs. each, which is too high.
Can you please advise how to optimize the data transformation? I don't want to reinvent the wheel but in case you have any idea or somebody also faced this performance issue. Thank you in advance.
@warkolm: Technically possible but we cannot do that, sensitive data shouldn't reach elastic engine. At the ETL data integration layer it should be masked before forwarding to elastic. Thanks
That sounds like a very expensive brute force way to address the problem. Is there any pattern to the words/phrases you are masking? If not I suspect it would be more efficient to create a custom plugin that loads the full dictionary and the processes the message word by word comparing it to the dictionary and generating a new, updated message based on the matched data.
HI Christian, there is no such pattern as this masking will happen in the greedy message. I wrote a Java program to perform the masking and it is doing it well in terms of performance. I only need to call it from Ruby as custom filter plugin. Will this be fine? Any thoughts on this
I believe there is a new Java API that can be used to create plugins, so you might be able to convert your code into a filter plugin. Not sure how well documented this is or whether it has been finalised. Maybe @guyboertje or someone else from the Logstash team knows?
We are releasing experimental support for plugins written in Java in 6.6.0 in the not too distant future. There will be a specific blog post by Dan Hermann explaining about the Java Plugin API.
Hi @warkolm, another approach I can think of is if we let the data flow in to elastic and then restrict the field to the user and recreate a new masked field using ingest pipeline processors. But I am not sure if in the ingest processors that elastic have is supporting lookup from external dictionary reference. Any idea? Thanks
I have a similar use case, I'm trying to use a dictionary with ~ 19 millions lines and 700 MB, but I'm still looking on how to implement it, since loading it on memory as a .yml file does not seem as a good idea.
The memcached filter is already available? I saw that 6.6.0 was launched today, but didn't find any information about a memcached filter.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.