Processing large amoung of data with Logstash

Hi,

Sorry for the text, but I tried to detail my problem as much as possible to make it clear.

I am part of a medical devices team. We run hundreds of tests over the devices to assure the quality and we save the results in a .txt file.

In that file, each line represent a campaign of tests where we have the informations about the campaign, the tests, the results of each test and the comments of each test as well.

To have an idea, is something like (each ";" represent a new field):
* Line 1 : Name_campaign; Timestamp; PC_host; OS; IP; PixDyn_version; Test1; Result1; Comment1; Test2; Result2; Comment2
* Line 2 : Name_campaign; Timestamp; PC_host; OS; IP; PixDyn_version; Test1; Result1; Comment1; Test2; Result2; Comment2; Test3; Result3; Comment3; Test4; Result4; Comment4

As you can notice, the number os tests can vary from line to line. Expecting to make my life easier, I created a .py to modify the file. The .py get the line with the maximum number of fields (X) and complete the other lines with an empty string ('') until X. Like that, I could use just one grok filter to match each occurrence instead of have one filter per line.

My grok filter is a bizarre thing that looks like that :

grok {
		patterns_dir => ["/patterns"]
		match => { "message" => [ "^%{NUMBER:VersionTableauStat};(?<SessionName>%{YEAR}\_%{MONTHNUM}\_%{MONTHDAY}\__%{HOUR}\_%{MINUTE}\_%{SECOND});%{HOSTNAME:PCName};%{CISCO_REASON:OS};%{IP:Host_DLL};%{IP:NIOS};%{IP:FPGA1};%{IP:FPGA2};%{IP:PULL_DLL};%{WORD:SN_PU};%{WORD:SN_Détecteur};%{WORD:SignOn};(?<PULB_PN>%{AD_TYPE:AD};%{WORD:Test_name};%{RESULT_TEST:Result};%{COMMENT_TEST:Comment})" ] }
}

That is the short version, because, for example, in one project I can have 150 tests, i.e., the patterns {WORD:Test_name};%{RESULT_TEST:Result};%{COMMENT_TEST:Comment} will be replicated 150 times. Before someone ask, I created a regex to the last two patterns and they work (not the problem).

With that grok I was expecting to have in the variables "Test_name", "Result" and "Comment" the name of all the tests I ran in one campaign, its results and comments respectively. Like that I could use Kibana or Grafana to visualize which tests failed in some campaign and to monitore in real time.

And the grok works if I do not have a large amount of tests. When I run over a campaign that has 20 tests, for example, the grok matches. But when I have a huge amount of tests, like 150, it shows "groktimeout". To avoid that, I put "timeout_milis => 0" and the file is running for 2:30 hours and no results yet.

My question is : there is a way to make the process faster/to optimize the filter/an easier solution ?

Thank you for read it. :kissing_closed_eyes:.

How are the RESULT_TEST and COMMENT_TEST patterns defined?

They are defined as follow :

RESULT_TEST \s*[0-9]*
COMMENT_TEST \s*[-\a-zA-Z0-9àâäéèëîïôùûüÿ\(\)\_]*

With RESULT_TEST I am expecting to read and empty string or numbers and with COMMENT_TEST I am expecting to read an empty string or the special characters from French and some special characters such as (, ), - and _

I haven't tested the performance, but for simplicity I'd ignore your patterns, have trust in the semicolon as a delimiter and loop over the repeating fields with Ruby.

dissect {
  mapping => { "message" => "%{Name_campaign}; %{Timestamp}; %{PC_host}; %{OS}; %{IP}; %{PixDyn_version}; %{tests_string}" }
}
ruby {
  code => '
    counter = 0
    tests = []
    event.get("tests_string").split("; ").each_with_index { |v,k|
      i = k%3
      tests[counter] = {} if i == 0
      fieldname = i==0?"name":i==1?"comment":"result"
      tests[counter][fieldname] = v
      counter = counter+1 if i==2
    }
    event.set("tests", tests)
  '
}
1 Like

I agree with Jenni. Use dissect to parse the first part of the line then use ruby. You could use the .scan method of the String class

    ruby {
        code => '
            s = event.get("tests_string")
            if s
                event.set("matches", s.scan(/\s*([^;]+); ([^;]+); ([^;]+)(;|$)/))
            end
        '
    }

which will result in a variable length array such as

       "matches" => [
    [0] [
        [0] "Test1",
        [1] "Result1",
        [2] "Comment1",
        [3] ";"
    ],
    [1] [
        [0] "Test2",
        [1] "Result2",
        [2] "Comment2",
        [3] ""
    ]
],

You will likely want to iterate over the array and reformat the data.

1 Like

Thank you, Badger and Jenni.

Both solutions worked and they are much faster than my idea (less than 7 minutes to send all the document).

:slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.