Hi!
I'm importing a large (20M rows) table into elasticsearch and sometimes these rows have a column notes
that is either empty or holding a JSON string.
As this is a legacy app, this notes
column may hold two different structures of JSON such as:
{"1":{"date_added":"2018-01-04 01:23:31","username":"user","content":"Note Content"}}
or
[{"date_added":"2019-10-08 13:30:08","username":"username","content":"Note Content"}]
There might be more than one note in this JSON array. The values of the keys might be numeric, but I need to make sure they're casted to string always, hence the .map
.
The following is the part of my ruby filter that's trying to handle both cases, however on the 1st example above, it saves the note twice and I don't understand why.
Any thoughts on how to fix this problem?
# Cast all notes values to string
notes = event.get('notes')
notes_json = notes.empty? ? JSON['{}'] : JSON.parse(notes)
notes_json.to_a.each_with_index do |(n_key, n_val), n_index|
test_key = Integer(n_key) rescue false
if test_key.is_a? Integer
string_notes = Hash[n_val.map{|k,v|[ k, v.to_s ]}]
else
string_notes = Hash[n_key.map{|k,v|[ k, v.to_s ]}]
end
notes_json[n_index] = string_notes
end
event.set('notes', notes_json)
In my index template notes
is marked as of nested type:
"notes": {
"type": "nested"
}
A note always has date_added
, username
, and content
. There can be zero (empty) or multiple notes in this column.
When the column value is empty I would actually want to save notes as empty, I'm not sure if that's possible?
If I run logstash, the final mapping of notes show up like this:
"notes": {
"properties": {
"0": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"1": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
So it's actually handled quite wrong... I need to have notes as a nested array of json objects, each having date_added, username and content fields. Not sure how to fix this. Thoughts?
Thanks for your time!