Huge dictionary in logstash translate filter

I have huge dictionary file to be used in logstash translate filter , around 180k entries. I tried to reduce it to around 80k but still always getting error when starting the pipeline , as below :

[2020-10-30T22:11:47,115][ERROR][logstash.agent ] Failed to execute action {:id=>:"dump-subsc", :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<dump-subsc>, action_result: false", :backtrace=>nil}

Translate filter in logstash pipeline as below :

translate {

        field => "[imeiTac]"

        destination => "[deviceName]"

        dictionary_path => "/usr/share/logstash/pipeline/imei_tac.csv"

        fallback => "unknown device"

        refresh_interval => 0

      }

Below are sample of the dictionary entries:

    "01124500","iPhone A1203"
    "01130000","iPhone A1203"
    "01130100","iPhone A1203"
    "01136400","iPhone A1203"
    "01136500","iPhone A1203"
    "01143400","iPhone A1203"
    "01147200","iPhone A1203"
    "01154600","iPhone-A1203"
    "01161200","iPhone 3G A1241"
    "01161300","iPhone 3G A1241"
    "01161400","iPhone 3G A1241"
    "01165400","iPhone-A1203"
    "01171200","iPhone 3G A1241"
    "01171300","iPhone 3G A1241"
    "01171400","iPhone 3G A1241"
    "01174200","iPhone 3G A1241"

I have tried to use around 5k entries only in dictionary , pipeline is working fine and i can get the result.

However when i added full dictionary ( around 80k entries ) pipeline is not started.

I have tried to increase the JVM from 1GB , 2GB to 4GB without success.

According to logstash translate filter documentation, it has been tested with around 100k dictionary entries.

How can i make my pipeline working with this huge dictionary entries ?

Thanks,

Enable log.level debug and see if you get a more informative error message than that "Failed to execute action".

1 Like

What is the size of the dictionary file? Try increasing the number of entries until it stops working.

Go from 5k to 10k, then 15k, 20k until it fails.

I had a similar problem a couple of years ago, I've had a huge dictionary and logstash took to long to start, blocking the pipeline, and the same thing happened during the scheduled refresh.

To solve this, instead of the translate filter I used the memcached filter and stored my dictionary in a memcached server.

2 Likes

Thank you , i found the problem after activating debug. The dictionary file is not sanitized properly , there are some error in the middle of the file.

[2020-10-31T11:59:46,942][DEBUG][logstash.javapipeline    ][dump-subsc] Pipeline terminated by worker error {:pipeline_id=>"dump-subsc", :exception=>#<LogStash::Filters::Dictionary::DictionaryFileError: Translate: Unclosed quoted field on line 4040. when loading dictionary file at /usr/share/logstash/pipeline/imei_tac.csv>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/csv.rb:1927:in `block in shift'"

Which is looks like below:

"35166808","PIXI 4 7" 4G android"
"35168607","DEXP Ixion ES2 5""
"35218207","DEXP Ixion M 4""
"35249807","DEXP Ixion ML2 5""
"35288007","DEXP Ixion ML 4.5""
"35295808","POP4 6" 4G android"
"35296308","POP4 6" 4G android"
"35296408","POP4 6" 4G android"

After fixing it , pipeline is working.

size of dictionary is around 4MB. I found the error after activation debug as suggested by @Badger

I tried with around 68k entries right now and will increase to full 180k

Thanks for your suggestion about memcached @leandrojmp , i will experiment with that later if this translate filter is causing some slowness / delay in the throughput.

You may consider using also a ingest pipeline directly in elasticsearch for a huge dump (like subscribers infos)

1 Like

Thank you @ylasri

seems many things to be explored :slight_smile:

Yes, all depend on how frequently dictionnary data will be updated :slight_smile:

Just to update with the solution for this problem, i was able to load 180k in translate dictionary, however it is not giving me consistent result, many key values that exist in the dictionary but logstash is giving me "unknown device" ( fallback value ).

I ended up setting memcached and using it instead of dictionary and it is working perfectly.

i indexed around 4M records and adding 1 field from memcached and it took less than 30min.

1 Like

What do you mean? When would you use translate and when ingest pipeline?

The translate filter has a nice refresh_interval option, so that looks good if the dictionary often changes.

@heric Can you share the config needed for this memcache, given your dictionary in /usr/share/logstash/pipeline/imei_tac.csv ? Thx!

Hi Peter,

Configuration for memcached is simple, in my case it is shown below
I am mapping imeiTac to deviceName

      memcached {
                        hosts => ["memcached_server_ip"]
                        get => {
                        "%{imeiTac}" => "[deviceName]"
                        }
                }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.