Hi!
I'm importing a large (20M rows) table into elasticsearch and sometimes these rows have a column notes that is either empty or holding a JSON string.
As this is a legacy app, this notes column may hold two different structures of JSON such as:
{"1":{"date_added":"2018-01-04 01:23:31","username":"user","content":"Note Content"}}
or
[{"date_added":"2019-10-08 13:30:08","username":"username","content":"Note Content"}]
There might be more than one note in this JSON array. The values of the keys might be numeric, but I need to make sure they're casted to string always, hence the .map.
The following is the part of my ruby filter that's trying to handle both cases, however on the 1st example above, it saves the note twice and I don't understand why.
Any thoughts on how to fix this problem?
# Cast all notes values to string
notes = event.get('notes')
notes_json = notes.empty? ? JSON['{}'] : JSON.parse(notes)
notes_json.to_a.each_with_index do |(n_key, n_val), n_index|
test_key = Integer(n_key) rescue false
if test_key.is_a? Integer
string_notes = Hash[n_val.map{|k,v|[ k, v.to_s ]}]
else
string_notes = Hash[n_key.map{|k,v|[ k, v.to_s ]}]
end
notes_json[n_index] = string_notes
end
event.set('notes', notes_json)
In my index template notes is marked as of nested type:
"notes": {
"type": "nested"
}
A note always has date_added, username, and content. There can be zero (empty) or multiple notes in this column.
When the column value is empty I would actually want to save notes as empty, I'm not sure if that's possible?
If I run logstash, the final mapping of notes show up like this:
"notes": {
"properties": {
"0": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"1": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"date_added": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
So it's actually handled quite wrong... I need to have notes as a nested array of json objects, each having date_added, username and content fields. Not sure how to fix this. Thoughts?
Thanks for your time!