Grok Trying to continue with pattern based on match

Hi, I'm still new to grok and logstash, and appreciate any help. So, I'm currently setting up a grok pattern, to extract a field from a large message field with a lot of human text. The data is extremely unstructured, so I am attempting to extract the source number from the source id, which looks like S012345.
The pattern i setup:
grok {
match => {
"message" => "source id%{GREEDYDATA:name}[sS]%{NUMBER:source_id}"
}
}
However, for 50% of the cases the source id is turning out to be the current index's ID, due to the word source id being present sometimes in paragraphs.
My question comes here: Is there anyway to insert a condition within grok, to continue to search for a match if the source_id == index_id? I've been looking at the syntax for a while, but couldn't really find a similar scenario. Thank you for your help.

Hi Talal,

Some sample data would provide a little clarity.

Where is this index_id field?

Hi NerdSec, Thank you very much for your time.
The index_id field is ingested from the data. While the source_ID will be extracted from the message field.The JSON template from the data includes 3 fields, Applicant field, the Index_ID field, and the message field. The message field is extremely unstructured throughout.
A sample message field where the pattern makes a mistake would be:
"The specifications statement is identical to the Source_ID device.
RE: S123039
Index_ID S123039
Source_ID Electrical Appliance S120492
Dated October 6, 2016
Received October 7, 2016"
In this case, the source_id is being extracted as S123039 because it is present after source_id is mentioned in text. However, I can't be more specific as there are many different formats for what comes between the Source_ID and the S120492 (sometimes a name, sometimes more information). Although the index_ID is present in this message field, it is also in the data itself, when ingested. Is there any possibility to compare the values, and if they are equal to skip the match.

I hope this provides some clarity.

Hi Talal,

If you are receiving messages in json, then you should use the json codec in your input.

Or you can also use the json filter. JSON filter plugin | Logstash Reference [8.11] | Elastic

Once you have used this filter, you will now have the unstructured data in the message (Or whatever is the key name) field available to you separately. The question is, does the new field contain multiline data as mentioned here?

If it is multiline, then you would need to convert it to a single line using the multiline filter, and then apply grok. Your grok pattern looks perfectly fine and as tested in the Kibana grok debugger, it works perfectly fine, with a minor tweak. But I would be making some assumptions about the message structure here or more precisely the field placement

%{GREEDYDATA} Index_ID %{GREEDYDATA} Source_ID%{GREEDYDATA}[sS]%{NUMBER:number} %{GREEDYDATA}

Let's take the following log message as example:

{
  'Applicant_field': 'bob',
  'Index_ID':  'kimchi', 
  'message': 'The specifications statement is identical to the Source_ID device. RE: S123039 Index_ID S123039 Source_ID Electrical Appliance S120492 Dated October 6, 2016 Received October 7, 2016'
}

The logstash filter that would parse this info would look something as follows:

filter {
  json {
    source => "message"
    target => "parsed"
  }
  grok {
    match => { "[parsed][message]" => "%{GREEDYDATA} Index_ID %{GREEDYDATA} Source_ID%{GREEDYDATA}[sS]%{NUMBER:number} %{GREEDYDATA}"}
  }
}

In case this does not work, you could have conditionals, where you can check for _grokparsefailure in tags and then apply a second grok filter. Hope this helps.

PS: Please use formatting to make your posts a little more readable. :slight_smile:

Cheers!

Thank you so much for your help! Please let me know if I'm being too troublesome.
This wont work because sometimes the message field may look like the following:

{
"The specifications statement is identical to the Source_ID device.
RE: S123039
Index_ID S123039
Dated October 6, 2016
Received October 7, 2016"
......TEXT......
Source_ID Electrical Appliance S120492
}

So with this application, 50% of the cases, where the word 'source_id' was present in the TEXT part, the source_id was extracted as the index_id, because the id itself would also be mentioned in the text. Do you have an idea on how I could fix this? A simple conditional within grok would work, where it ignores an ID == to index_id.

I've been trying to work on this problem; however, can't find a way forward. Any help would be appreciated. Thank you so much.

Yes. This too is possible. Use conditionals:

if [parsed][index_id] == [parsed][source_id]
{
  grok {
    remove_field => [ "field names you don't need" ]
    match => { "[parsed][message]" => "%{GREEDYDATA} Index_ID %{GREEDYDATA}  Source_ID%{GREEDYDATA}[sS]%{NUMBER:number} %{GREEDYDATA} Source_ID%{GREEDYDATA}[sS]%{NUMBER:number}" }
  }
}

Notice the use of two GREEDYDATA to identify the second instance of Source_ID field.

Having multiple GREEDYDATA fields can be really expensive since it can lead to backtracking and even timeouts. I would anchor the patterns to known parts of the message

    grok { match => { message => [ "^Index_ID %{WORD:Index_ID}$" ] } }
    grok { match => { message => [ "^Source_ID %{DATA} [Ss]%{INT:Source_ID}$" ] } }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.