How to transform data through ruby code

Hi
I need a help for extract data from one of filed from CSV.

the header for these data presents like below:

EPS_BEARER_ID-BEARER_QCI-ARP_PL-ARP_PCI-ARP_PVI-GBR_UL-GBR_DL-BEARER_CAUSE-DEFAULT_BEARER_ID
5-7-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-5|7-5-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-7;

and this is the definition on particular fields:

but it uses a filter to clean up the data
which at a later stage arranges the field along with undefined entries on the whole plane "undefined"

	mutate {
		gsub => ["message","undefined-",""]
	}

 mutate {
		gsub => ["message","[-]*undefined",""]
	}

Let's assume that the fields look without "undefined" as shown below in the screenshot.

How can I separate these data so that they are presented according to the given pattern. The "undefine" fields introduce some order here but where they are not used they waste data stored in the elastic. Therefore, I would like to parse the data without them. But how to do it using ruby code?

Do You have any idea about approach?
As first hint I though about use


filter {
kv {
source => "kvtags"
field_split => "|"
value_split => "="
}
}

How can I build the array with dictionary (name of fields and use together with KV filter)?

also the tricky part is to make an array to match value with a predetermined dictionary
[ "eps_bearer_id", "bearer_qci", "bearer_casue_default", "bearer_i" ]
Can someone help on the first step in ruby code?

I didn't understand what you are trying to do and what is your issue, can you give more details about it?

What is your original message and what do you expect as the output message? Share plain text, not pictures.

What I got is that this is the header of your files:

EPS_BEARER_ID-BEARER_QCI-ARP_PL-ARP_PCI-ARP_PVI-GBR_UL-GBR_DL-BEARER_CAUSE-DEFAULT_BEARER_ID

And this is sample document:

5-7-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-5|7-5-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-7;

Your header has 9 columns, but the document you shared has 18, you seem to have a pipe, |, between the two documents, it is always like that?

It would be better to split those messages into separate events, is this what you want to do?

If you share more sample messages it would be better to try to understand and see what it is the issue.

Let's say I have a such field with data, this filed called "bearer":

7-5-#0(bearer successful)-7|5-7-#0(bearer successful)-5|6-7-#0(bearer successful)-6

I need to match for each of 7-5-#0(bearer successful)-7 entries in line with name

[ "eps_bearer_id", "bearer_qci", "bearer_casue_default", "bearer_i" ]

for example:

eps_bearer_id = 7
bearer_qci = 5
bearer_casue_default = #0(bearer successful)
bearer_i = 7

And about the undefined in your sample message?

Also, in this message you have those 3 blocks of data separated by pipes, are these data related?

7-5-#0(bearer successful)-7|5-7-#0(bearer successful)-5|6-7-#0(bearer successful)-6

If the position of the data in each block is always the same, transform it into fields is pretty easy, you can do that with a dissect filter.

Can you give more context? Because if the data between the pipes are not related, the best way to deal with this data would be to create an event for each one

So for the sample message you shared:

7-5-#0(bearer successful)-7|5-7-#0(bearer successful)-5|6-7-#0(bearer successful)-6

You would have 3 events:

7-5-#0(bearer successful)-7
5-7-#0(bearer successful)-5
6-7-#0(bearer successful)-6

Is something like this that you are trying to do?

ok, so in my case some of documents have none or 1, 2 or 3 blocks so the count of these events is not regular. By definition, there may be a maximum of 11 block in a document. It means that in one document it could be a few of events. The final results would be achieve from each of event translation or match to name of field.

template1

5-7-#0(bearer successful)-5
5-9-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-9-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-9-#0(bearer successful)-5
5-7-#0(bearer successful)-5|6-5-#0(bearer successful)-6
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5|7-1-#0(bearer successful)-6
5-7-#0(bearer successful)-5
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
7-5-#0(bearer successful)-7|5-7-#0(bearer successful)-5|6-7-#0(bearer successful)-6
6-5-#0(bearer successful)-6|5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5
5-7-#0(bearer successful)-5

expected results for template1:

eps_bearer_id_1 = 7
bearer_qci_1 = 5
bearer_casue_default_1 = #0(bearer successful)
bearer_i_1 = 7

eps_bearer_id_2 = 6
bearer_qci_2 = 5
bearer_casue_default_2 = #0(bearer successful)
bearer_i_2 = 6
[...]

template2

5-7-2-disabled-enabled-#0(bearer successful)-5
5-5-1-enabled-disabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#1(bearer failed)-5
-
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
-
5-7-11-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-11-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#1(bearer failed)-5
-
-
5-7-2-disabled-enabled-#0(bearer successful)-5
5-9-10-disabled-enabled-#1(bearer failed)-5
-
-
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-6-disabled-disabled-#1(bearer failed)-5
-
5-7-2-disabled-enabled-#0(bearer successful)-5
5-7-2-disabled-enabled-#0(bearer successful)-5

If we thought about two templates the position in each block is the same. But we need to have distinguish template on the begging by the number of the sign "-"

expected results for template2:

eps_bearer_id_1 = 5
bearer_qci_1 = 7
ARP_PL_1 = 2
ARP_PCI_1 = disabled
ARP_PVI_1 = enabled
bearer_casue_default_1 = #0(bearer successful)
bearer_i_1 = 5

eps_bearer_id_2 = 5
bearer_qci_2 = 7
ARP_PL_2 = 2
ARP_PCI_2 = disabled
ARP_PVI_2 = enabled
bearer_casue_default_2 = #0(bearer successful)
bearer_i_2 = 5
[...]

If in some way I have not been clarified, please ask for a question.

Now I think I understand a little.

I think that everything would be a lot easier if you do not remove the undefined fields, this way you just need one dissect filter to parse your fields, and you can remove the fields that have undefined.

You seem to have a fixed size message where the fields will always exist, but sometimes some of them can have a undefined value, so it is way easier to parse the entire message and remove the undefined fields than remove the undefined fields and create a parse for each kind of message.

For example, considering this is your header, which will be also the name of the fields:

EPS_BEARER_ID-BEARER_QCI-ARP_PL-ARP_PCI-ARP_PVI-GBR_UL-GBR_DL-BEARER_CAUSE-DEFAULT_BEARER_ID

Then you have this message, which has 2 events separated by the |

5-7-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-5|7-5-undefined-undefined-undefined-undefined-undefined-#0(bearer successful)-7;

This pipeline would parse your messages and remove fields with undefined value.

filter {
    # remove the ; at the end
    mutate {
        gsub => ["message",";",""]
    }
    # split the events
    if "|" in [message] {
        mutate {
            split => ["message", "|"]
        }
        split {
            field => "message"
        }
    }
    # drop lines starting with -
    if [message] =~ "^-" {
        drop {}
    }
    # parse messages
    dissect {
        mapping => {
            "message" => "%{eps_bearer_id}-%{bearer_qci}-%{ARP_PL}-%{ARP_PCI}-%{ARP_PVI}-%{GBR_UL}-%{GBR_DL}-%{bearer_cause_default}-%{bearer_id}"
        }
    }
    # blacklist undefined fields
    prune {
        blacklist_values => [ 
            "ARP_PL", "undefined",
            "ARP_PCI", "undefined",
            "ARP_PVI", "undefined",
            "GBR_UL", "undefined",
            "GBR_DL", "undefined"
            ]
    }
}

Yes I agree with You, it's much easier to match with pattern when we left undefined field.

Here is my part of logstash code

filter {
	if [message] =~ "###FILE###" { drop{ } }
	if [message] =~ "###FOOTER###" { drop{ } }
	if [message] =~ "###STATS###" { drop{ } }
	if [message] =~ "event_id;event_result;" { drop{ } }
	if [message] =~ "termination_cause=" { drop{ } }
	if [message] =~ "Created by parse_ebm_log" { drop{ } }
	if [message] =~ "Input file=" { drop{ } }
	if [message] =~ "Number of" { drop{ } }
	if [message] =~ "Processing end time=" { drop{ } }
	if [message] =~ "Processing start time=" { drop{ } }
	if [message] =~ "Total number of events" { drop{ } }
	if [message] =~ /^\s*$/ { drop{ } }

	
	
	
	mutate {
		gsub => ["message","undefined-",""]
	}

 mutate {
		gsub => ["message","[-]*undefined",""]
	}
	
	mutate {
		gsub => ["message","\"","sp-ch"]
	}

	csv {
		separator => ";"
		
   
     columns => ["sgu_file_name","sgu","event_id","event_result","date","time","millisecond","duration","a21_message_type","a_msisdn","access_type","activation_trigger","activation_type","active_timer","age_of_location_estimate","amf_ue_ngap_id","amfi","apn","attach_type","back_off_timer","bearers",]

and then I remove the fields that are empty

skip_empty_columns => true
@leandrojmp
Do we have any tips for left this "undefined" fields only for particular column example "bearers" ?

cosmetic question:

If the data is in the sub-column "bearer" then the reference definition should look like this:
gsub => ["[message][bearers]",";",""]

???

filter {
    # remove the ; at the end
    mutate {
        gsub => ["[message][bearers]",";",""]
    }
    # split the events
    if "|" in [message][bearers] {
        mutate {
            split => ["bearers", "|"]
        }
        split {
            field => "bearers"
        }
    }
    # drop lines starting with -
    if [message] =~ "^-" {
        drop {}
    }
    # parse messages
    dissect {
        mapping => {
            "bearers" => "%{eps_bearer_id}-%{bearer_qci}-%{ARP_PL}-%{ARP_PCI}-%{ARP_PVI}-%{GBR_UL}-%{GBR_DL}-%{bearer_cause_default}-%{bearer_id}"
        }
    }
    # blacklist undefined fields
    prune {
        blacklist_values => [ 
            "ARP_PL", "undefined",
            "ARP_PCI", "undefined",
            "ARP_PVI", "undefined",
            "GBR_UL", "undefined",
            "GBR_DL", "undefined"
            ]
    }
}

No, using [message][bearers] in logstash means that you have a JSON object called message with a field called bearers.

{ 
    "message" : {
        "bearers": "some value"
    }
}

The example you shared is a plain text, so your message is in the field top-level message, which is the field where logstash will store the event from the input when it receives plain text.

If your source message has anything after the ; or you need this ; for anything, then you shoul adapt the pipeline I shared.

Yes @leandrojmp share a great idea, but this approach generate o lot of documents in elastic , it is much better to build arrays and index the parameters. @Badger I think that You can keep on eye on this above case?

I would like to bump a topic