Processing large amoung of data with Logstash

evannobre · July 21, 2020, 11:56am

Hi,

Sorry for the text, but I tried to detail my problem as much as possible to make it clear.

I am part of a medical devices team. We run hundreds of tests over the devices to assure the quality and we save the results in a .txt file.

In that file, each line represent a campaign of tests where we have the informations about the campaign, the tests, the results of each test and the comments of each test as well.

To have an idea, is something like (each ";" represent a new field):
* Line 1 : Name_campaign; Timestamp; PC_host; OS; IP; PixDyn_version; Test1; Result1; Comment1; Test2; Result2; Comment2
* Line 2 : Name_campaign; Timestamp; PC_host; OS; IP; PixDyn_version; Test1; Result1; Comment1; Test2; Result2; Comment2; Test3; Result3; Comment3; Test4; Result4; Comment4

As you can notice, the number os tests can vary from line to line. Expecting to make my life easier, I created a .py to modify the file. The .py get the line with the maximum number of fields (X) and complete the other lines with an empty string ('') until X. Like that, I could use just one grok filter to match each occurrence instead of have one filter per line.

My grok filter is a bizarre thing that looks like that :

grok {
		patterns_dir => ["/patterns"]
		match => { "message" => [ "^%{NUMBER:VersionTableauStat};(?<SessionName>%{YEAR}\_%{MONTHNUM}\_%{MONTHDAY}\__%{HOUR}\_%{MINUTE}\_%{SECOND});%{HOSTNAME:PCName};%{CISCO_REASON:OS};%{IP:Host_DLL};%{IP:NIOS};%{IP:FPGA1};%{IP:FPGA2};%{IP:PULL_DLL};%{WORD:SN_PU};%{WORD:SN_Détecteur};%{WORD:SignOn};(?<PULB_PN>%{AD_TYPE:AD};%{WORD:Test_name};%{RESULT_TEST:Result};%{COMMENT_TEST:Comment})" ] }
}

That is the short version, because, for example, in one project I can have 150 tests, i.e., the patterns {WORD:Test_name};%{RESULT_TEST:Result};%{COMMENT_TEST:Comment} will be replicated 150 times. Before someone ask, I created a regex to the last two patterns and they work (not the problem).

With that grok I was expecting to have in the variables "Test_name", "Result" and "Comment" the name of all the tests I ran in one campaign, its results and comments respectively. Like that I could use Kibana or Grafana to visualize which tests failed in some campaign and to monitore in real time.

And the grok works if I do not have a large amount of tests. When I run over a campaign that has 20 tests, for example, the grok matches. But when I have a huge amount of tests, like 150, it shows "groktimeout". To avoid that, I put "timeout_milis => 0" and the file is running for 2:30 hours and no results yet.

My question is : there is a way to make the process faster/to optimize the filter/an easier solution ?

Thank you for read it. .

Badger · July 21, 2020, 1:00pm

How are the RESULT_TEST and COMMENT_TEST patterns defined?

evannobre · July 21, 2020, 1:14pm

They are defined as follow :

RESULT_TEST \s*[0-9]*
COMMENT_TEST \s*[-\a-zA-Z0-9àâäéèëîïôùûüÿ\(\)\_]*

With RESULT_TEST I am expecting to read and empty string or numbers and with COMMENT_TEST I am expecting to read an empty string or the special characters from French and some special characters such as (, ), - and _

Jenni · July 21, 2020, 1:40pm

I haven't tested the performance, but for simplicity I'd ignore your patterns, have trust in the semicolon as a delimiter and loop over the repeating fields with Ruby.

dissect {
  mapping => { "message" => "%{Name_campaign}; %{Timestamp}; %{PC_host}; %{OS}; %{IP}; %{PixDyn_version}; %{tests_string}" }
}
ruby {
  code => '
    counter = 0
    tests = []
    event.get("tests_string").split("; ").each_with_index { |v,k|
      i = k%3
      tests[counter] = {} if i == 0
      fieldname = i==0?"name":i==1?"comment":"result"
      tests[counter][fieldname] = v
      counter = counter+1 if i==2
    }
    event.set("tests", tests)
  '
}

Badger · July 21, 2020, 2:05pm

I agree with Jenni. Use dissect to parse the first part of the line then use ruby. You could use the .scan method of the String class

    ruby {
        code => '
            s = event.get("tests_string")
            if s
                event.set("matches", s.scan(/\s*([^;]+); ([^;]+); ([^;]+)(;|$)/))
            end
        '
    }

which will result in a variable length array such as

       "matches" => [
    [0] [
        [0] "Test1",
        [1] "Result1",
        [2] "Comment1",
        [3] ";"
    ],
    [1] [
        [0] "Test2",
        [1] "Result2",
        [2] "Comment2",
        [3] ""
    ]
],

You will likely want to iterate over the array and reformat the data.

evannobre · July 23, 2020, 7:28am

Thank you, Badger and Jenni.

Both solutions worked and they are much faster than my idea (less than 7 minutes to send all the document).

system · August 20, 2020, 7:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing plain text with Logstash filter Logstash	6	3910	July 13, 2017
Filter grok : Few lines differents in log file Logstash	5	577	March 13, 2017
Problem with Grok filter with my logfile Logstash	2	600	July 6, 2017
Grok performance Logstash	5	1308	January 18, 2018
Logstash grok multiple pattern , multi-line Logstash	2	302	March 19, 2021

Processing large amoung of data with Logstash

Related topics