Splitting a huge field into multiple fields

Hi All,

I am getting data in Json format in my eleasticsearch. but one one the fields called "messages" has over 30K lines, I need to split lines into the "message" based on the format. i have a special character where i have data between them should be made into different field. the special char is ">"
example data
</>{
"_index": "jenkins-2019.01.08",
"_type": "doc",
"_id": "50ytLmgBIs_AS77pvVhP",
"_version": 1,
"_score": null,
"_source": {
"@version": 1,
"source_host": "http://jenkins.xx.com",
"@timestamp": "2019-01-08T18:14:11.289Z",
"message": [
"Branch event",
"Started by GitLab push by XXXXXXX",
"retrieving head-revision for us31096-at-Gate3-24hr",
"Obtained Jenkinsfile from bbe820800b0199e6e2a63e3e3ae2a8100b84e111",
"Running in level: PERFORMANCE_OPTIMIZED",
"Loading library dsl@1234",
"Attempting to resolve v3.4.5 from mote references...",
" > git --version # timeout=10",
"using GIT_ASKPASS to set credenti Read Repository",
" > git ls-remote -h https://gitlab.xxxxxx.xxxx.xxxx. # timeout=10",
"Could not find v2.3.4 in remote references. Pulling heads to local for deep search...",
" > git rev-parse --is-inside-work-tree # timeout=10",
"Setting origin to https://gitlab.xxx.xxx.xxxx.xx.git",
" > git config remote.origin.url https://gitlab-xxxx-xxxx-xxxx-xx..git # timeout=10",
"Fetching origin...",
"Fetching upstream changes from origin",
" > git --version # timeout=10",
" > git config --get remote.origin.url # timeout=10",
</>

This is just sample data. we can see ">" can we split them into multiple fields for the data between ">"

Not sure what you are asking. In your sample data message appears to be an array of strings, but you say the field is called messages. If you do not specify the problem correctly any offered solution will not work. Are you trying to extract a subset of those strings into another field?

If you want to extract the lines that start with " >" then this would work

ruby {
    code => '
        a = event.get("messages").keep_if { |x| x.start_with? (" >") }
        event.set("commands", a)
    '
}

which would get you

  "commands" => [
    [0] " > git --version # timeout=10",
    [1] " > git ls-remote -h https://gitlab.xxxxxx.xxxx.xxxx. # timeout=10",
    [2] " > git rev-parse --is-inside-work-tree # timeout=10",
    [3] " > git config remote.origin.url https://gitlab-xxxx-xxxx-xxxx-xx..git # timeout=10",
    [4] " > git --version # timeout=10",
    [5] " > git config --get remote.origin.url # timeout=10"
]

So... for that sample data, what other fields do you actually want added to the event?

Dear Badger,

Thanks for your input here. sorry i was not clear before. Here is the situtation,

I have a huge field called "messages" i would like to break that message into smaller fields. i want the data between two ">" to be into smaller fields.
for example
from above sample data "message" contain lot of git commands which starts with ">" and output is in the next line. the next git command starts with ">" and has the output till there is another ">" is found

Now my requirement is
each git command should be field and output of the command should be the value.

I hope this is clear.

Are you saying that given the input

"messages": [
"Attempting to resolve v3.4.5 from mote references...",
" > git --version # timeout=10",
"using GIT_ASKPASS to set credenti Read Repository",
" > git ls-remote -h https://gitlab.xxxxxx.xxxx.xxxx. # timeout=10",
"Could not find v2.3.4 in remote references. Pulling heads to local for deep search...",
" > git rev-parse --is-inside-work-tree # timeout=10",
"Setting origin to https://gitlab.xxx.xxx.xxxx.xx.git",
" > git config remote.origin.url https://gitlab-xxxx-xxxx-xxxx-xx..git # timeout=10",
"Fetching origin...",
"Fetching upstream changes from origin",
" > git --version # timeout=10",
" > git config --get remote.origin.url # timeout=10",

you want the git command to be the field name and the output to be the field value. As in

[
" > git --version # timeout=10" :  "using GIT_ASKPASS to set credenti Read Repository",
" > git ls-remote -h https://gitlab.xxxxxx.xxxx.xxxx. # timeout=10": "Could not find v2.3.4 in remote references. Pulling heads to local for deep search...",
" > git rev-parse --is-inside-work-tree # timeout=10": "Setting origin to https://gitlab.xxx.xxx.xxxx.xx.git",
" > git config remote.origin.url https://gitlab-xxxx-xxxx-xxxx-xx..git # timeout=10": "Fetching origin...\nFetching upstream changes from origin",
" > git --version # timeout=10": ""
....]

and so on? If not, can you show in JSON format what output you want?

Yes Badger, that is correct. i would need the same way as you have shown above.

OK, so start off with something like this. If it doesn't do quite what you want then you will have to update it.

ruby {
    code => '
        a = event.get("messages")
        if a then
            a.each { |x|
                if x.start_with? (" >") then
                    if @oKey then
                        if event.get(@oKey) then
                            event.set(@oKey, ( [ event.get(@oKey) ] << @oValue ).flatten)
                        else
                            event.set(@oKey, @oValue)
                        end
                    else
                        event.set("preamble", @oValue)
                    end
                    @oKey = x
                    @oValue = ""
                else
                    if ! @oValue then
                        @oValue = ""
                    end
                    if @oValue == "" then
                        @oValue = x
                    else
                        @oValue = @oValue + "\n" + x
                    end
                end
            }
            if @oKey then
                event.set(@oKey, @oValue)
            else
                event.set("preamble", @oValue)
            end
        end
    '
}

As always, error handling is left as an exercise for the reader. If my Ruby coding style makes your eyeballs bleed then I apologize to your eyeballs.

Thanks Badger for your support. I was able to implement the same.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.