Read and filter 1million xml files

AntonQUT · September 7, 2017, 4:28am

I have never used Logstash, but want to.

I need to read approximately 1 million xml files, extract 2 specific fields (title and abstract) and load them into a named elasticsearch index with corresponding fields.

I looked at the File input plug-in , but it suggests that it is not designed for reading in a whole file from the beginning.

Can anyone direct me to a good example of something like this to get me started.

thanks

magnusbaeck · September 7, 2017, 5:20am

I'd write a script to read the XML files and somehow get them to Logstash, perhaps by serializing each XML file to a single line and writing to a much smaller number of files that Logstash's file input will have no problems reading. Directing Logstash's file input directly at the million XML files with a giant wildcard will not end well.

AntonQUT · September 7, 2017, 5:40am

Ha ha - thanks Magnus. When you say 'somehow get them to Logstash', that is the bit I am unsure of

Each document will represent a document in elasticsearch with fields for the abstract and title and full text (i.e. the whole file). I have a java program to perform the entire job, but I was keen to take advantage of Logstash's stop/restart control if something goes wrong along the way.

Which option do you think is best:

Use my own option as Logstash is not suited to this kind of problem
Use an existing plugin with Logstash (please let me know which one)
Write my own plugin for logstash - is this straightforward?

Thank you.

Anton

magnusbaeck · September 7, 2017, 6:01am

I think option four that I described earlier is best:

Use a program to read the XML files and append them in serialized form (i.e. one file per line) to a small number of files that Logstash's file input can read and process

AntonQUT · September 7, 2017, 10:40pm

Ok thanks Magnus. What do you call a small number of files? What is reasonable?
Thanks

magnusbaeck · September 8, 2017, 6:04am

I'd keep it in the hundreds or low thousands. If you have multiple file inputs in your config they'll run in parallel which will increase performance, so one idea could be to have one file per file input.

system · October 6, 2017, 6:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing big xml files in logstash Logstash	1	250	November 30, 2022
How to quikly parsing 1 million xml Logstash	20	5061	March 2, 2017
Indexing many xml files Logstash	4	951	July 6, 2017
Load an xml files Logstash	7	659	February 14, 2019
Logstash Huge data import + Fast import Logstash	7	2880	July 6, 2017

Read and filter 1million xml files

Related topics