How to split into multiple events dynamically for a given json? Tried from various question in forums


(Ritesh) #1

Hi All
These question has been asked multiple times in forum as i see but no definite answer.
I saw one of the Elastic member answered in a nice way, but it is not working for me.

I have a json data in this format coming from a input value

{"page":1,"per_page":3,"total":12,"total_pages":4,"data":[{"id":1,"first_name":"George","last_name":"Bluth","avatar":"https://s3.amazonaws.com/uifaces/faces/twitter/calebogden/128.jpg"},{"id":2,"first_name":"Janet","last_name":"Weaver","avatar":"https://s3.amazonaws.com/uifaces/faces/twitter/josephstein/128.jpg"},{"id":3,"first_name":"Emma","last_name":"Wong","avatar":"https://s3.amazonaws.com/uifaces/faces/twitter/olegpogodaev/128.jpg"}]}

Now i am using this in the filter to split into multiple events, but it is not splitting

filter {
  json {
    source => "message"
    target => "[@metadata][data]"
  }
  ruby {
    code => "

      string_indexed = event.get('[@metadata][data]')
      items_array = string_indexed.keys.sort_by(&:to_i).map do |key|
        string_indexed[key]
      end
      event.set('[@metadata][items-array]', items_array)
    "
  }
  split {
    field => "[@metadata][items-array]"
    target => "[@metadata][single-item]"
  }
  mutate {
    rename => {
      "[@metadata][single-item][id]" => "id"
      "[@metadata][single-item][first_name]" => "first_name"
      "[@metadata][single-item][last_name]" => "last_name"
    }
  }
}

Further in logs i get this error

Ruby exception occurred: undefined method `keys' for nil:NilClass

The above one is not splitting into multiple events.
I am looking for a ruby code which create multiple events dynamically on a json datset


#2

Why not just do this?

    json { source => "message" }
    split { field => "data" }

undefined method `keys' for nil:NilClass suggests that [@metadata][data] is nil, meaning the json filter did not parse. I suggests removing all the @s from @metadata and seeing what you get in stdout { codec => rubydebug }

I find that the split fails. items-array looks like this:

  "metadata" => {
    "items-array" => [
        [0] 3,
        [1] 1,
        [2] 12,
        [3] 4,
        [4] [
            [0] {
                        "id" => 1,
                "first_name" => "George",
                    "avatar" => "https://s3.amazonaws.com/uifaces/faces/twitter/calebogden/128.jpg",
                 "last_name" => "Bluth"
            },
            [1] {
                        "id" => 2,
                "first_name" => "Janet",
                    "avatar" => "https://s3.amazonaws.com/uifaces/faces/twitter/josephstein/128.jpg",
                 "last_name" => "Weaver"
            },
            [2] {
                        "id" => 3,
                "first_name" => "Emma",
                    "avatar" => "https://s3.amazonaws.com/uifaces/faces/twitter/olegpogodaev/128.jpg",
                 "last_name" => "Wong"
            }
        ]
    ],

and that gets you "exception"=>"undefined method `empty?' for 3:Fixnum" when it tries to split the first value in the array. That's fixed in the most recent (3.1.7) version of the filter.

Once that is fixed. items-array only has one entry for data, so the split filter has no effect and all the renames fail.


(Ritesh) #3

Hi @Badger

What i am looking for is a universal solution for all JSON data from input. I do not want to use split filter to go over multiple records.
Split filter is single threaded, when used with high volume data lets say over 100K, it creates clones and then split.
I want to use ruby to identify all the key/values

on the above example
I upgraded the filter to 3.1.7
Removed all @ from metadata, still i get same error

Ruby exception occurred: undefined method `keys' for nil:NilClass

also i observed,i am not getting any target at all from the json filter.
Do you have any working example on Ruby Filter with iterate over the keys,so that i can get a flat json and then split in one go

Want to try something similar suggested by @yaauie


#4

But the transformation that Ry wrote is converting

{ 
"0": { "foo": "a" },
"1": { "bar": "b" }
}

into an array

[
{ "foo": "a" },
{ "bar": "b" }
]

You are starting off with an array. Converting it to a different array is not going to make any difference to how the split filter performs.


(Ritesh) #5

Hi @Badger, is there any other way wherein i can convert 1 event in message of json data into multiple events without using the split filter? ( especially with ruby filter)
Please let me know

Thanks


#6

You said elsewhere that you have 80,000 bytes of JSON. Unless the objects are tiny, I expect that is around 1,000 entries in the array. You also say it takes 4 hours, which is over 10,000 seconds. So over 10 seconds per event. Looking at the loop in the split filter, it is hard to imagine how that could be taking 10 seconds.

So, can try running with "--log.level debug", which will cause it to log a line for each entry in the array

[DEBUG][logstash.filters.split   ] Split event {:value=>{"last_name"=>"Bluth", "avatar"=>"https://s3.amazonaws.com/uifaces/faces/twitter/calebogden/128.jpg", "id"=>1, "first_name"=>"George"}, :field=>"[metadata][items-array][0]"}

Then review the timestamps on those events. In particular, how much of the time is spent in the input and json parsing, and are the events spit out by split regularly spaced, or are there long gaps in the output.


(Ritesh) #7

The one which i mentioned in this thread,i am experimenting with a small toy dataset.
Which is this one ( https://reqres.in/api/users)

Actually i have a bigger dataset which is having almost 15 columns and 80,000 records. From that forum discussion thread it is clear that because of split filter, it goes into serial mode and take 4 hrs.
In order to avoid split filter, i started using ruby and to see if anything i can do to achieve some result.

My ultimate aim is to get immediate result in output, at least for those events which has done processing in filter and should be available immediately in ouput, rather than waiting in queue to complete first split filter then do other operations in serial mode. Which is not acceptable, Logstash is a streaming tool so it should work somehow.


#9

OK, so if you have a piece of JSON that contains 80,000 records, each of which has 15 columns, then it might be 10 MB. And you are going to create 80,000 copies of that, which involves allocate 800 GB of memory (or possibly 1.6 TB if you have a copy of the message in addition to the parsed data). Oh, and this is Java, so I guess every char is two bytes, so perhaps 3.2 TB of memory to be allocated and GC'd. I'm not surprised it takes a long time.

Replacing the json codec with

json {
    source => "message"
    target => "data"
    remove_field => [ "message" ]
}

may avoid a second copy of the 10 MB on each event (it it is present -- I'm not sure what you get from an http_poller with a codec).

Inside the split filter, it is cloning the event here, which creates a new 10 or 20 MB object, and the next line replaces that with something that only takes a couple of hundred bytes. That's a really expensive way of doing it. We need something more like the UNIX system call vfork, if you are familiar with that.

Instead of cloning the event, just set event_split to an empty new event and copy over the fields you need like timestamp, host, version, etc. (still doing event_split.set(@target, value) etc.). Then yield that.

However, the details of replacing the clone with an empty new event are beyond me. You need someone who understand a little more about events than I do.


#10

The following may work. In a file called splitData.rb put

def register(params)
    @field = params['field']
    @target = params['target']
end

def filter(event)
    data = event.get(@field)
    event.remove(@field)
    a = []
    data.each { |x|
        e = event.clone
        e.set(@target, x)
        a << e
    }
    a
end

Then call it using

json {
    source => "message"
    remove_field => [ "message" ]
}
ruby {
    path => '/home/user/splitData.rb'
    script_params => { field => "data" target => "data" }
}

The critical point is to remove the data field before call event.clone.

It occurred to me that the split filter ought to be able to do this optimization (remove source before cloning if it is going to be overwritten). Looking at the code it appears that this line may be trying to do this. However I don't know what target refers to (not @target, which is never nil) so I am not sure what it does


(Ritesh) #11

Hey Thanks Badger, your solution looks much more promising now. Let me try out this and will let you know the result