6.1.2 performs existing ruby filter extremly SLOWER than 5.2.0. Need help!


#1

Hi all,

Last time I wanted to migrate from 5.1.2 to 6.1.0 I had an extremeperformance drop on production - so I needed to rollback.
Currently I am analyzing the issues, trying to build some rally tracks to check where my issue came from. While that I was going to anonymize my logs via logstash to be able to ingest production near data to a test environment.

During this I noticed, that processing and anonymizing my log with 6.1.2 is incredibly slow. I read from file, parse, additionally anonymize and write to json file.

I ran the same input / filter / output combination in 5.1.2 and 6.1.2:

  • 5.1.2 is having a throughput of about 115MB per minute on my laptop.
  • 6.1.2 has a throughput of about 0.5MB per minute on the same file, same machine.

Contrary to 5.1.2 version 6.1.2 is offering logstash monitoring. I noticed that my ruby script is taking 500ms per call.

Any idea about it? Any breaking changes which I have overseen?

Sure, may be that there is potential to optimize filter or ruby, but I don't see why LS 6.1.2 is that much slower.

Thanks, Andreas


#2

some update from my side.
To help you to help me I wanted to find out which logstash version brought the change in the performance.

5.6.7 is running fine.
6.1.0, 6.1.2 and 6.2.1 have the same performance issue.
6.0.0 is located in the middle. Signifficantly slower than 5.x but faster than the 6.1 and 6.2 named above.

Now I will add some telemetry inside my ruby script. Maybe I can track down the function which became that slow.


(madhur) #3

I would suggest you to increase the pipeline workers.
also you could try updating ruby filter and other dependencies.

Hope this helps


#4

On my laptop it spawn up 8 workers (quadcore + HT).

I will try it, but I wonder why the versions (all running in default config on my laptop) have so much different behavior.
Also increasing the pipeline workers may only be a workaround. It does not address the performance loss, which is also on per message basis, not only on throughput.

Is it that uncommon to use ruby filter that no one else encountered this issue since 6.0 GA?


#5

Hi,

I still got the same issue. In the meantime I anonymized our log, so I can share it with you.
I would like to feed you out parsing code, configuration and logfile and kindly ask you (especially elastic) to reproduce it.
Hopefully you can Identify a bug in Logstash and fix it, or you can pin some issue in the parsing code, which needs to be changed.

I've built a new dev system, running on centos 7.4., 16 GB RAM, 4 vCPU on ms azure
Before giving you the code and sample to reproduce, let me give some words for explanation:

  • After input I call a ruby filter to save the current time in the event. At last possible position before output I get the current time again. So I can calculate the time each event took in logstash processing (except input and output).

    • [logstash][processing][filterTime] -> time the event took in processing
    • [logstash][processing][filterEnd] -> time the event finished processing
  • I stripped down my configuration to only the necessary steps (only for the logfile with isssue, removed beats input, etc.) (problem is the same, but it simplifies the reproducing for you)

  • I instrumented the ruby script, to measure the times a function call takes.

    • [logstash][times][ruby][processFetchedCountAndTime]
    • [logstash][times][ruby][processInterval]
    • [logstash][times][ruby][processTM_IDs]
    • [logstash][times][ruby][processLWDR_IDs]
    • [logstash][times][ruby][processLWVD_DATA_CODE]
  • Also I measure the complete runtime from entering to ending of the ruby code block.

    • [logstash][times][ruby][completeScript]
  • I output two files

    • useractionlog.json -> that is the normal parsing which would go to elasticsearch output normally.
    • useractionlog_debug.json -> lists processing timestamp, time of logstash processing, time spent in each function of ruby code and original message.

My parsing configuration can be found here:
https://pastebin.com/xw25eb5V

The anonymized logfile which is taken as input can be found here: https://pastebin.com/7JvQCG94
Only the first 100 lines are unique, the rest are duplicates of the first 100 lines.
Pastebin limits at 512k for files. So I uploaded only the first 500 of 1000. If you would like to run the same test like me, just append the file to itself once.


#6

The debug output in my dev system looks like the following:

LS 5.2.0:

"[logstash][processing][filterEnd]":2018-03-02T08:25:29.385Z,"[logstash][processing][filterTime]":"25","[logstash][times][ruby][completeScript]":1,"[logstash][times][ruby][processFetchedCountAndTime]":0,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1129728318 1012584202 1x1a322102 1x1 1x303 1x1  02.12.2017 23:00:35 de_de - -----SYSTEMAUTH: Login with UNKNOWN-VIA-SOAP 31.12.9999, GID: 6, UID: 1445, Tuxedo-User: 1012584202, Actual user count: x (1129728318, 11)
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.389Z,"[logstash][processing][filterTime]":"4","[logstash][times][ruby][completeScript]":1,"[logstash][times][ruby][processFetchedCountAndTime]":0,"[logstash][times][ruby][processInterval]":1,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":113404088 1964218408 1x1a3376e0 1x1 1x36b 1x1  02.12.2017 23:00:56 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 0 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.393Z,"[logstash][processing][filterTime]":"3","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":1,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":113404088 1964218408 1x1a3376e0 1x1 1x36b 1x1  02.12.2017 23:00:56 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 16 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.412Z,"[logstash][processing][filterTime]":"18","[logstash][times][ruby][completeScript]":10,"[logstash][times][ruby][processFetchedCountAndTime]":0,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1155076378 1671413991 1x1a3221b0 1x1 1x22d 1x1  02.12.2017 23:01:13 de_de - 0.0 DB_TMLIST: *FAST MODE*,  [SHARE USER IS OUTSIDE START/END TIMES OR SERVICE CALLED IN FAST MODE - EMPTY LIST] [NO FILTER], [NO SORT], 0...END (PageSize: INFINITE), , Fetched: 0, 0 ms, **NO VEHICLES**, *OK*
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.444Z,"[logstash][processing][filterTime]":"17","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":-1","message":1155076378 1671413991 1x1a3221b0 1x1 1x22d 1x1  02.12.2017 23:01:13 de_de - 0.0 DB_TMLIST: XSQL-Filter: 'EMPTY', [NO SORT], 0...END (PageSize: INFINITE), , Fetched: 4, 16 ms, <empty shares>, *OK*
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.449Z,"[logstash][processing][filterTime]":"4","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1155076378 1671413991 1x1a3221b0 1x1 1x22d 1x1  02.12.2017 23:01:13 de_de - 0.0 GETUSERPROFILE: UID=xxx, Version-String: <<S API 3.2.0 19.06.2012>> (ok), Version-Date: 2012.06.19, info-text: "xxx.xxx.xxx.xxx via soaps://www.url.de/rpcrouter2.2/soap [SOAP-API 3.9.2]", DTI_ID=xxx, Sees: Vehicles Map Banner Logbook_Preliminary Customer_Share_User, OK.
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.452Z,"[logstash][processing][filterTime]":"3","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":0,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":1,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1155076378 1671413991 1x1a3221b0 1x1 1x22d 1x1  02.12.2017 23:01:13 de_de - -----SYSTEMAUTH: Login with S API 3.2.0 19.06.2012, GID: 6, UID: 953, Tuxedo-User: 1671413991, Actual user count: x (1155076378, 6)
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.456Z,"[logstash][processing][filterTime]":"3","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1172692764 3762519376 1x1a36f6c1 1x1 1x23f 1x1  02.12.2017 23:00:15 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 0 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.460Z,"[logstash][processing][filterTime]":"3","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":1,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1172692764 3762519376 1x1a36f6c1 1x1 1x23f 1x1  02.12.2017 23:00:15 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 16 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:25:29.464Z,"[logstash][processing][filterTime]":"3","[logstash][times][ruby][completeScript]":2,"[logstash][times][ruby][processFetchedCountAndTime]":1,"[logstash][times][ruby][processInterval]":0,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":-1","message":1222306988 789323931 1x1a3221ad 1x1 1x210 1x1  02.12.2017 23:01:01 de_de - 0.1 DB_TMLIST: XSQL-Filter: 'TM_DTI_ID IN (xxx)', [NO SORT], #0...#199 (PageSize: 200), , Fetched: 0, 62 ms, **NO VEHICLES**, *OK*
}

so we can see, that the processing of the logLine in the pipeline takes about something between 3 and 25 ms. the ruby script [logstash][times][ruby][completeScript] takes between 2 and 10 ms.

The node statistics of logstash are showing this:
https://pastebin.com/VZtppawN

so ruby script says:

      {
        "id": "useractionlog:ruby:1",
        "events": {
          "duration_in_millis": 3244,
          "in": 2000,
          "out": 2000
        },
        "name": "ruby"
      },

#7

LS 6.2.1:

{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.756Z,"[logstash][processing][filterTime]":"31901","[logstash][times][ruby][completeScript]":99,"[logstash][times][ruby][processFetchedCountAndTime]":71,"[logstash][times][ruby][processInterval]":1,"[logstash][times][ruby][processTM_IDs]":10,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":-4","message":1007633098 622121154 1x1a31016e 1x1 1x307 1x1  02.12.2017 23:01:26 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 0 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.757Z,"[logstash][processing][filterTime]":"31830","[logstash][times][ruby][completeScript]":208,"[logstash][times][ruby][processFetchedCountAndTime]":174,"[logstash][times][ruby][processInterval]":19,"[logstash][times][ruby][processTM_IDs]":1,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":-12","message":1019639776 246951507 1x1a3221ac 1x1 1x366 1x1  02.12.2017 23:01:02 de_de - 0.0 DB_SHARINGTILST: [NO FILTER], [NO SORT], 0...END (PageSize: INFINITE), , Fetched: 0, 15 ms, ok., ok.
}
{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.757Z,"[logstash][processing][filterTime]":"31830","[logstash][times][ruby][completeScript]":168,"[logstash][times][ruby][processFetchedCountAndTime]":82,"[logstash][times][ruby][processInterval]":16,"[logstash][times][ruby][processTM_IDs]":66,"[logstash][times][ruby][processLWDR_IDs]":0,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1019639776 246951507 1x1a3221ac 1x1 1x366 1x1  02.12.2017 23:01:07 de_de - 0.1 DB_TMLIST: XSQL-Filter: 'TM_DTI_ID IN (xxx)', [NO SORT], 0...END (PageSize: INFINITE), , Fetched: 0, 62 ms, **NO VEHICLES**, *OK*
}
{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.757Z,"[logstash][processing][filterTime]":"31830","[logstash][times][ruby][completeScript]":95,"[logstash][times][ruby][processFetchedCountAndTime]":84,"[logstash][times][ruby][processInterval]":8,"[logstash][times][ruby][processTM_IDs]":0,"[logstash][times][ruby][processLWDR_IDs]":1,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":104105070 2585400546 1x1a36b013 1x1 1x22c 1x1  02.12.2017 23:01:10 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 15 ms
}
{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.757Z,"[logstash][processing][filterTime]":"31829","[logstash][times][ruby][completeScript]":199,"[logstash][times][ruby][processFetchedCountAndTime]":92,"[logstash][times][ruby][processInterval]":88,"[logstash][times][ruby][processTM_IDs]":17,"[logstash][times][ruby][processLWDR_IDs]":1,"[logstash][times][ruby][processLWVD_DATA_CODE]":0","message":1073514545 2437858820 1x1a323ce3 1x1 1x316 1x1  02.12.2017 23:02:02 de_de - 1.0 DB_TMLIST: XSQL-Filter: 'EMPTY', [NO SORT], 0...END (PageSize: INFINITE), , Fetched: 0, 998 ms, **NO VEHICLES**, *OK*
}
{"[logstash][processing][filterEnd]":2018-03-02T08:33:24.757Z,"[logstash][processing][filterTime]":"31806","[logstash][times][ruby][completeScript]":188,"[logstash][times][ruby][processFetchedCountAndTime]":90,"[logstash][times][ruby][processInterval]":80,"[logstash][times][ruby][processTM_IDs]":1,"[logstash][times][ruby][processLWDR_IDs]":7,"[logstash][times][ruby][processLWVD_DATA_CODE]":-6","message":1087807487 2267378791 1x1a33d2f7 1x1 1x362 1x1  02.12.2017 23:00:28 de_de - 0.0 RD_DTCO_LIST: XSQL-Filter: '( RD_IS_ARCHIVED = FALSE)', [NO SORT], #0...#9 (PageSize: 10), , Fetched: 0, 0 ms
}

We can see that each event is taking about 31 second (not milliseconds) and the ruby script part goes up to 100 to 900 ms within the first lines.

the complete node stats can be found here: https://pastebin.com/eL7hGdUs
the mentioned ruby filter takes now

   {
                    "id": "useractionlog:ruby:1",
                    "events": {
                        "duration_in_millis": 535892,
                        "in": 2000,
                        "out": 2000
                    },
                    "name": "ruby"
                },

this makes about 250ms in average. But I am also wondering, where does logstash spend the rest of the 31 seconds?

So very strange everything.

We are using logstash in default configuration out of the tar file you offer to download.
Only change in their configuration is that we changed heap size of min and max to 3G.

We are calling logstash using this command:

/usr/local/elastic/logstash/logstash-$LS_VERSION/bin/logstash -f /usr/local/elastic/logstash/conf --path.settings /usr/local/elastic/logstash/logstash-$LS_VERSION/config $options

$LS_VERSION is a script variable which tells which logstash version to start.
$options is the argument of the script. Empty in this case.

So its a real odd issue for us which prevents us from migrating to 6.x.
Please try to reproduce and ask if you need anything from me.

Thanks a lot, Andreas


#8

Really NO reply? Not even a "cannot reproduce"?

I assume we have a logstash bug here. Would be nice if someone (especially from elastic) tries to reproduce and would give some feedback.


#9

during further investigation, seems that I found something!

Is it possible, that the inline code functionality has become slower from 5.x to 6.x?
I wanted to strip down the code and check where the time is being lost. Because modification via eclipse is easier I changed the inline code to outsourced ruby script. and Tada: running the same code via refrenced ruby script via path runs as fast than in previous versions.

I compared 1 output event. The output there was identically between 5.2.0 and 6.2.2. I will do some further checks, but I assume that my outsourcing of the script has been done without errors / without losing some processing steps.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.