the array looks like this.
data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.....ETC
I have this JSON array that nests and I would like to get all the data with in each
[children][data]
so if [replies] is present in that there will be another set of [children][data] and so on and so on.
I found this but have no idea on how to get this working for my data set since I am a complete noob when it comes to ruby.
Flattening an Array with Recursion
def recursive_flatten(array, results = [])
array.each do |element|
if element.class == Array
recursive_flatten(element, results)
else
results << element
end
end
results
end
Thx but the code you have there doesn't seem to be recursion. My JSON was just an example there could be 5 or there could be 10 or more nested JSON in different examples.
I took the JSON you linked to in your first posted put it on a single line and read it using a file input. What I get back from my code is
"replies" => [
[0] {
[...]
"permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizeo5/",
[...]
"body_html" => "<div class=\"md\"><p>reply to 2nd comment</p>\n</div>",
[...]
},
[1] {
[...]
"permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekize5c/",
[...]
"body" => ">here is a reply to the reply of the 1st comment\n\nanother here is a reply to the reply of the 1st comment",
[...]
},
[2] {
[...]
"permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizdfx/",
[...]
"body" => "another level deeper on the 1st comment",
[...]
},
[3] {
[...]
"permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizc4b/",
[...]
"body" => "here is a reply to the reply of the 1st comment",
[...]
},
[4] {
[...]
"permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizbej/",
[...]
"body" => "here is a reply to the 1st comment",
[...]
}
],
[...]
Get rid of the json codec on the input. The filter expects the json to serialized in [message].
Alternatively, if you have other reasons for using the codec, get rid of the json filter and change the ruby code. The codec appears to split the array into events, so this piece of code will need to change
Thx for the help, I really appreciate the time you are spending.
I did remove the codec json but now I am receiving an error when it tries to index the data. I did stdout but it also shows a _rubyexception
[2019-04-10T17:30:01,523][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method `each' for nil:NilClass
[2019-04-10T17:30:01,529][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method `each' for nil:NilClass
[2019-04-10T17:30:01,677][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2ndtest-reply-2019.04.10", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x55abba13>], :response=>{"index"=>{"_index"=>"logstash-2ndtest-reply-2019.04.10", "_type"=>"_doc", "_id"=>"lcmWCWoBpdFPTkX2Ohre", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [data.children.data.edited] of type [boolean] in document with id 'lcmWCWoBpdFPTkX2Ohre'", "caused_by"=>{"type"=>"json_parse_exception", "reason"=>"Current token (VALUE_NUMBER_FLOAT) not of boolean type\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@655dde02; line: 1, column: 175676]"}}}}}
ruby {
code => '
a = []
def processReplies(p, a)
if p["replies"] and
p["replies"]["data"] and
p["replies"]["data"]["children"] and
p["replies"]["data"]["children"].kind_of?(Array)
p["replies"]["data"]["children"].each { |v1|
processReplies(v1["data"], a)
# Cannot remove it until after iterating over it
v1["data"].delete("replies")
a << v1["data"]
}
end
end
theData = event.get("[data][children]")
if theData.kind_of?(Array)
theData.each { |v1|
processReplies(v1["data"], a)
}
end
#event.remove("data")
if a != []
event.set("replies", a)
end
'
}
For the sample data that gets you two events, and the second one has an array of 5 replies.
It's got to be an issue with the HTTP_Poller and how it is bringing in the JSON. It's still is not flattening the message with the Comments. It's only bringing in that 1st obj which is just the post title and other values. "Test JSON output of comments"
input {
http_poller {
urls => {
test1 => "https://www.reddit.com/r/softwaretesting/comments/bbhytv/.json"
}
request_timeout => 60
schedule => { cron => "* * * * * UTC"}
codec => "json"
}
}
filter {
ruby {
code => '
a = []
def processReplies(p, a)
if p["replies"] and
p["replies"]["data"] and
p["replies"]["data"]["children"] and
p["replies"]["data"]["children"].kind_of?(Array)
p["replies"]["data"]["children"].each { |v1|
processReplies(v1["data"], a)
# Cannot remove it until after iterating over it
v1["data"].delete("replies")
a << v1["data"]
}
end
end
theData = event.get("[data][children]")
if theData.kind_of?(Array)
theData.each { |v1|
processReplies(v1["data"], a)
}
end
#event.remove("data")
if a != []
event.set("replies", a)
end
'
}
}
output {
elasticsearch {
hosts => ["http://server:9200"]
index => "logstash-test-reply-%{+YYYY.MM.dd}"
}
}
[2019-04-11T12:26:00,500][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-tester-reply-2019.04.11", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x2d2d0b3d>], :response=>{"index"=>{"_index"=>"logstash-tester-reply-2019.04.11", "_type"=>"_doc", "_id"=>"3d-mDWoBqKIyXB-qQN-S", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Can't merge a non object mapping [data.children.data.replies] with an object mapping [data.children.data.replies]"}}}}
I think I see what the issue is. When the data is sent to Elastic it sends both obj. in the same Event and that is why I am only getting the 1st which does not have the comments only the post details.
Then the 2nd Obj is failing to index.
I really only need the 2nd obj anyways is there a way to remove that 1st obj with just the post details.
Oh. That's not good. In the second event there are multiple entries in the [data][children] array. In the first one (Couldn't you do that in your own subreddit...) [data][replies] is
"replies" => "",
That tells elasticsearch that it should expect [data.children.data.replies] to be a string. In the second one (Here is a 2nd comment) [data][replies] is an object
A field in elasticsearch can be a string, or it can be an object, but it has to consistently be one or the other. So when it is a string, we need to change it to be an object
theData = event.get("[data][children]")
if theData.kind_of?(Array)
theData.each_index { |x|
if theData[x]["data"]["replies"] == ""
event.set("[data][children][#{x}][data][replies]", {})
end
}
theData.each { |v1|
processReplies(v1["data"], a)
}
end
You could do that based on whether or not [data][children][0][data][replies] exists. The normal test for field existence does not work here, you have to do it in ruby.
if ! event.get("[data][children][0][data][replies]")
event.cancel
end
getting so close.
I used the solution to turn the strings into events as well as I got rid of that 1st obj. but now we are running into issues with the other comments in the replies.
Can't merge a non object mapping [data.children.data.replies.data.children.data.replies] with an object mapping [data.children.data.replies.data.children.data.replies]"}}}}
Correct I only need the the replies as well as the data.children.data
I just tried to do that as well as make a few changes with mutate and it didn't work. All it did was removed the replies and data and didn't leave the root level fields like I defined in the mutate.
Unless you have added a split filter both [replies] and [data][children] are arrays, so you would have to add [0] to everything to reference the first entry.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.