Huge dictionary in logstash translate filter

heric · October 30, 2020, 7:27pm

I have huge dictionary file to be used in logstash translate filter , around 180k entries. I tried to reduce it to around 80k but still always getting error when starting the pipeline , as below :

[2020-10-30T22:11:47,115][ERROR][logstash.agent ] Failed to execute action {:id=>:"dump-subsc", :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<dump-subsc>, action_result: false", :backtrace=>nil}

Translate filter in logstash pipeline as below :

translate {

        field => "[imeiTac]"

        destination => "[deviceName]"

        dictionary_path => "/usr/share/logstash/pipeline/imei_tac.csv"

        fallback => "unknown device"

        refresh_interval => 0

      }

Below are sample of the dictionary entries:

    "01124500","iPhone A1203"
    "01130000","iPhone A1203"
    "01130100","iPhone A1203"
    "01136400","iPhone A1203"
    "01136500","iPhone A1203"
    "01143400","iPhone A1203"
    "01147200","iPhone A1203"
    "01154600","iPhone-A1203"
    "01161200","iPhone 3G A1241"
    "01161300","iPhone 3G A1241"
    "01161400","iPhone 3G A1241"
    "01165400","iPhone-A1203"
    "01171200","iPhone 3G A1241"
    "01171300","iPhone 3G A1241"
    "01171400","iPhone 3G A1241"
    "01174200","iPhone 3G A1241"

I have tried to use around 5k entries only in dictionary , pipeline is working fine and i can get the result.

However when i added full dictionary ( around 80k entries ) pipeline is not started.

I have tried to increase the JVM from 1GB , 2GB to 4GB without success.

According to logstash translate filter documentation, it has been tested with around 100k dictionary entries.

How can i make my pipeline working with this huge dictionary entries ?

Thanks,

Badger · October 30, 2020, 8:28pm

Enable log.level debug and see if you get a more informative error message than that "Failed to execute action".

leandrojmp · October 30, 2020, 9:10pm

What is the size of the dictionary file? Try increasing the number of entries until it stops working.

Go from 5k to 10k, then 15k, 20k until it fails.

I had a similar problem a couple of years ago, I've had a huge dictionary and logstash took to long to start, blocking the pipeline, and the same thing happened during the scheduled refresh.

To solve this, instead of the translate filter I used the memcached filter and stored my dictionary in a memcached server.

heric · October 31, 2020, 9:52am

Thank you , i found the problem after activating debug. The dictionary file is not sanitized properly , there are some error in the middle of the file.

[2020-10-31T11:59:46,942][DEBUG][logstash.javapipeline    ][dump-subsc] Pipeline terminated by worker error {:pipeline_id=>"dump-subsc", :exception=>#<LogStash::Filters::Dictionary::DictionaryFileError: Translate: Unclosed quoted field on line 4040. when loading dictionary file at /usr/share/logstash/pipeline/imei_tac.csv>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/csv.rb:1927:in `block in shift'"

Which is looks like below:

"35166808","PIXI 4 7" 4G android"
"35168607","DEXP Ixion ES2 5""
"35218207","DEXP Ixion M 4""
"35249807","DEXP Ixion ML2 5""
"35288007","DEXP Ixion ML 4.5""
"35295808","POP4 6" 4G android"
"35296308","POP4 6" 4G android"
"35296408","POP4 6" 4G android"

After fixing it , pipeline is working.

heric · October 31, 2020, 9:55am

size of dictionary is around 4MB. I found the error after activation debug as suggested by @Badger

I tried with around 68k entries right now and will increase to full 180k

Thanks for your suggestion about memcached @leandrojmp , i will experiment with that later if this translate filter is causing some slowness / delay in the throughput.

ylasri · October 31, 2020, 9:57am

You may consider using also a ingest pipeline directly in elasticsearch for a huge dump (like subscribers infos)

heric · October 31, 2020, 10:07am

Thank you @ylasri

seems many things to be explored

ylasri · October 31, 2020, 10:08am

Yes, all depend on how frequently dictionnary data will be updated

heric · November 2, 2020, 3:43pm

Just to update with the solution for this problem, i was able to load 180k in translate dictionary, however it is not giving me consistent result, many key values that exist in the dictionary but logstash is giving me "unknown device" ( fallback value ).

I ended up setting memcached and using it instead of dictionary and it is working perfectly.

i indexed around 4M records and adding 1 field from memcached and it took less than 30min.

Peter_Nelissen · November 6, 2020, 6:43pm

What do you mean? When would you use translate and when ingest pipeline?

The translate filter has a nice refresh_interval option, so that looks good if the dictionary often changes.

Peter_Nelissen · November 6, 2020, 6:51pm

@heric Can you share the config needed for this memcache, given your dictionary in /usr/share/logstash/pipeline/imei_tac.csv ? Thx!

heric · November 8, 2020, 8:41am

Hi Peter,

Configuration for memcached is simple, in my case it is shown below
I am mapping imeiTac to deviceName

      memcached {
                        hosts => ["memcached_server_ip"]
                        get => {
                        "%{imeiTac}" => "[deviceName]"
                        }
                }

system · December 6, 2020, 8:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Translate filter with +2.5 million dictionary entries Logstash	1	336	February 28, 2020
Regarding big dictionary (2GB) in filter translate plug-in Logstash	6	434	July 25, 2019
Translate filter dictionary error in execution Logstash	2	282	July 25, 2018
[LOGSTASH] logstash-filter-translate plugin Logstash	1	282	May 7, 2019
Translate filter error with dictionary file Logstash	6	2037	July 6, 2017

Huge dictionary in logstash translate filter

Related topics