How to Upsert Nested Obj using grok and Logstash

So currently i'm analysing data from my MYSQL subtitles db, and putting them in ElasticSearch 5.2. Regardless, my ES logstash has the following filter:

filter {
    grok {
           match => ["subtitles", "%{TIME:[_subtitles][start]} --> %{TIME:[_subtitles][end]}%{GREEDYDATA:[_subtitles][sentence]}" ]

which is producing the following :

"_subtitles": {
                  "sentence": [
                     "im drinking latte",
                     "im drinking coffee",
                     "while eating a missisipi cake"
                  "start": [
                  "end": [

But what i want is this :

 "_subtitles": [
                          "sentence": "im drinking latte",
                          "start": "00:00:00.934",
                          "end": "00:00:02.902"
                     {... same structure as above},
                     {... same structure as above},

Having in mind that _subtitles will be nested by predefined mapping if needed.

And the original data is as follow:

00:00:00.934 --> 00:00:02.902
im drinking latte

00:00:01.934 --> 00:00:03.902
im drinking coffee

00:00:04.902 --> 00:00:05.839
while eating a missisipi cake

how can i achieve that [Nested obj], using grok's match pattern and placeholders ?

I'm pretty sure you can't. You'll have to postprocess the field with a ruby filter.

Yeah actually i ended up resolving the issue in that way.
but now the issue is that its over processing anyways i can enhance the one below.

So after a lot of research and reading i found THE ANSWER

I found the best way to do it is either :

  • Leave logstash and do my own script for migrating from mysql to Elastic, but then i'd have to do all the pattern recognition and replacement, which can get somehow complicated.
  • post-process the fields with a Ruby script/filter.

The solution was as follow:

ruby {
      code => "
        subtitles = []
        starts = event.get('start')
        ends = event.get('end')
        sentences = event.get('sentence')
        counter = 0
        starts.each do |v|
         temp_hash = {}
         temp_hash['index'] = counter
         temp_hash['start'] = v
         temp_hash['end'] = ends[counter]
         temp_hash['sentence'] = sentences[counter]
         counter += 1
        event.set('subtitles', subtitles)

Hope that helps.
But now i'm trying to improve this, because my ElasticSearch container fails with something like "cannot handle requests"/ goes off for a while.. just because of the indexing (currently around 20k row from mysql) into Elastic with around 40 nested objects for each.
Anything that i can do to make faster?
maybe a way to flag docs so i dont process them and mark them as processed the previous day or some'n ?


1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.