Ruby filter code for parsing JSON column sometimes insert values twice

Hi!

I'm importing a large (20M rows) table into elasticsearch and sometimes these rows have a column notes that is either empty or holding a JSON string.

As this is a legacy app, this notes column may hold two different structures of JSON such as:

{"1":{"date_added":"2018-01-04 01:23:31","username":"user","content":"Note Content"}}

or

[{"date_added":"2019-10-08 13:30:08","username":"username","content":"Note Content"}]

There might be more than one note in this JSON array. The values of the keys might be numeric, but I need to make sure they're casted to string always, hence the .map.

The following is the part of my ruby filter that's trying to handle both cases, however on the 1st example above, it saves the note twice and I don't understand why.

Any thoughts on how to fix this problem?

# Cast all notes values to string

notes = event.get('notes')
notes_json = notes.empty? ? JSON['{}'] : JSON.parse(notes)

notes_json.to_a.each_with_index do |(n_key, n_val), n_index|

   test_key = Integer(n_key) rescue false

   if test_key.is_a? Integer
      string_notes = Hash[n_val.map{|k,v|[ k, v.to_s ]}]
   else
      string_notes = Hash[n_key.map{|k,v|[ k, v.to_s ]}]
   end

   notes_json[n_index] = string_notes
end

event.set('notes', notes_json)

In my index template notes is marked as of nested type:

"notes": {
    "type": "nested"
}

A note always has date_added, username, and content. There can be zero (empty) or multiple notes in this column.

When the column value is empty I would actually want to save notes as empty, I'm not sure if that's possible?

If I run logstash, the final mapping of notes show up like this:

"notes": {
  "properties": {
    "0": {
      "properties": {
        "content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "date_added": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "username": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    },
    "1": {
      "properties": {
        "content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "date_added": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "username": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    },
    "content": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "date_added": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "username": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    }
  }
},

So it's actually handled quite wrong... I need to have notes as a nested array of json objects, each having date_added, username and content fields. Not sure how to fix this. Thoughts?

Thanks for your time!

Before this line executes, notes_json is a hash that contains

{"1"=>{"date_added"=>"2018-01-04 01:23:31", "username"=>"user", "content"=>"Note Content"}}

Afterwards, the hash gets a second entry

0=>{"date_added"=>"2018-01-04 01:23:31", "username"=>"user", "content"=>"Note Content"}

event.set converts that to

"0"=>{"date_added"=>"2018-01-04 01:23:31", "username"=>"user", "content"=>"Note Content"}

So you end up with a hash that contains two hashes.

1 Like

Oh I see... Not really a ruby developer I am, I didn't see it coming... It does make sense now.

I was really just trying to get rid of the keys like "1", "2" and cast the values for date_added, username and content to string before updating notes with event.set call. I want do avoid having the 0, 1 each having date_added, username and content underneath them in the final mapping.

Basically, in the final mapping I need to somehow nest multiple entries each having those three properties, but not be under zero, one etc as keys.

Could you please help me in achieving this?

I would consider handling the two cases separately

    ruby {
        code => '
            notes = event.get("notes")
            notes_json = notes.empty? ? JSON["{}"] : JSON.parse(notes)
            if notes_json.is_a? Hash
                a = []
                notes_json.each { |k, v| a << v.to_s }
                event.set("notes", a)
            end
            if notes_json.is_a? Array
                a = []
                notes_json.each { |v| a << v.to_s }
                event.set("notes", a)
            end
        '
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.