Logstash vs Spark vs something else

arisbanach · April 30, 2018, 9:12pm

I have been using Logstash to connect to Postgres to parse XML and extract certain xpaths and then push them to Elasticsearch. However, it's been fairly apparent that Logstash isn't meant for parsing entire files.

Is there a best tool for this job? I am new to Spark, but I don't know if that's overkill for this since it's all in just one database and not in Hadoop.

If Logstash is not the right tool for this, are there any recommendations you have for one?

Thank you

paz · May 2, 2018, 1:31pm

What are the limitations you are experiencing that make it apparent? Throughput or something else? Have you identified any bottlenecks?

In my opinion, using any distributed framework just to traverse an XML structure ingested from somewhere might indeed be an overkill.

arisbanach · May 3, 2018, 1:01am

One limitation is the multiline input for Logstash. I don't know of any way to configure it to have an entire file be an event. Logstash is meant for logs, and even the multiline input still needs to be configured for a set number of lines or a max of something.

paz · May 3, 2018, 12:46pm

As "entire file" you mean an entire XML structure spanning over multiple lines? If so, it's doable by properly configuring your multiline marker to be the top-level tag of your XML, as seen in this example.

arisbanach · May 3, 2018, 1:04pm

But the multiline codec still has a default max_lines value of 500, and even if I set that to something much higher, there is still a max value. I have run into a case where I don't know the total lines, so I just try to set it to something super high, but once in a while it still cuts off my document.

paz · May 3, 2018, 1:31pm

Ah, right. Yeah, I guess that setting is there to prevent memory explosion to endlessly consume lines due to malformed boundaries or whatever.

The reason that the document gets cut off even with a very hight max_lines value could be the max_bytes setting, as those two work in conjunction and either reaching the limit would cause the multiline termination.

You can try and also set it to a high value if you haven't done so, or (though it's a maybe dangerous approach) remove the checks altogether from the source code so you don't have to worry about it.

Thing is, by using other tools or homebrew code that doesn't have those limitations you potentially lose all other niceties that Logstash can have by default, like dealing with back-pressure/stalled or dead ES nodes/internal queue/etc.
That's why -in my opinion- is better to try and work around those issues than having to deal with the potential absence of such features in other tools.

Badger · May 3, 2018, 4:35pm

Have you tried an http input?

system · May 31, 2018, 4:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Handling Large XML Files Logstash	2	289	November 28, 2021
Xml-file too big for logstash? Logstash	28	2773	June 10, 2019
Logstash not process large XML with multiple xpath Logstash	4	603	October 2, 2019
How to effectively use multiline codec with a crazy long xml file? Logstash	3	480	January 29, 2018
Parsing XML managing arrays and multilines Logstash	3	1705	July 6, 2017

Logstash vs Spark vs something else

Related topics