Hi all,
I'm currently working on parsing a really nasty xml stream: events come in single file form, can get up to 5000 lines and, most important, includes arrays of arrays which crucially hold the data I need to parse and visualize.
After some tinkering I come to the solution of using multiple split filters: one for main array and others for the sub-arrays, this solution however leaves me in the cold if the subarray includes fields which i have not included (case I fear will come up sooner or later).
Is there any way to make the Split filter recursive so that it could manage any kind of subarrays?
I understand this would be very load intensive, but given the scope of the project, this is something already expected.
Thanks for the replies,
I suppose that the "correct" way to do that is through xpath (which to be honest i don't want to dive into unless someone can grant me beforehand it can handle nested arrays) so I think that for the moment I will stick to the multiple splits. It may not be elegant but it works.
After some testing it seems I find the error: the XML has an array node and multiples attributes in that very node and its children, so with a default xml filter configuration, logstash was mixing the values giving me a lot of headache.
The situation improved a lot after having changed those xml filter parameters:
force_array => false
force_content => true
All while keeping the split for the array node
Now I only got a cosmetic split error if the array node has only 1 value (it become an hash so it's not splittable) but i can live with it removing the tag
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.