I have been using Logstash to connect to Postgres to parse XML and extract certain xpaths and then push them to Elasticsearch. However, it's been fairly apparent that Logstash isn't meant for parsing entire files.
Is there a best tool for this job? I am new to Spark, but I don't know if that's overkill for this since it's all in just one database and not in Hadoop.
If Logstash is not the right tool for this, are there any recommendations you have for one?
One limitation is the multiline input for Logstash. I don't know of any way to configure it to have an entire file be an event. Logstash is meant for logs, and even the multiline input still needs to be configured for a set number of lines or a max of something.
As "entire file" you mean an entire XML structure spanning over multiple lines? If so, it's doable by properly configuring your multiline marker to be the top-level tag of your XML, as seen in this example.
But the multiline codec still has a default max_lines value of 500, and even if I set that to something much higher, there is still a max value. I have run into a case where I don't know the total lines, so I just try to set it to something super high, but once in a while it still cuts off my document.
Ah, right. Yeah, I guess that setting is there to prevent memory explosion to endlessly consume lines due to malformed boundaries or whatever.
The reason that the document gets cut off even with a very hight max_lines value could be the max_bytes setting, as those two work in conjunction and either reaching the limit would cause the multiline termination.
You can try and also set it to a high value if you haven't done so, or (though it's a maybe dangerous approach) remove the checks altogether from the source code so you don't have to worry about it.
Thing is, by using other tools or homebrew code that doesn't have those limitations you potentially lose all other niceties that Logstash can have by default, like dealing with back-pressure/stalled or dead ES nodes/internal queue/etc.
That's why -in my opinion- is better to try and work around those issues than having to deal with the potential absence of such features in other tools.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.