Read and filter 1million xml files

I have never used Logstash, but want to.

I need to read approximately 1 million xml files, extract 2 specific fields (title and abstract) and load them into a named elasticsearch index with corresponding fields.

I looked at the File input plug-in , but it suggests that it is not designed for reading in a whole file from the beginning.

Can anyone direct me to a good example of something like this to get me started.

thanks

I'd write a script to read the XML files and somehow get them to Logstash, perhaps by serializing each XML file to a single line and writing to a much smaller number of files that Logstash's file input will have no problems reading. Directing Logstash's file input directly at the million XML files with a giant wildcard will not end well.

1 Like

Ha ha - thanks Magnus. When you say 'somehow get them to Logstash', that is the bit I am unsure of :slight_smile:

Each document will represent a document in elasticsearch with fields for the abstract and title and full text (i.e. the whole file). I have a java program to perform the entire job, but I was keen to take advantage of Logstash's stop/restart control if something goes wrong along the way.

Which option do you think is best:

  1. Use my own option as Logstash is not suited to this kind of problem
  2. Use an existing plugin with Logstash (please let me know which one)
  3. Write my own plugin for logstash - is this straightforward?

Thank you.

Anton

I think option four that I described earlier is best:

  1. Use a program to read the XML files and append them in serialized form (i.e. one file per line) to a small number of files that Logstash's file input can read and process

Ok thanks Magnus. What do you call a small number of files? What is reasonable?
Thanks

I'd keep it in the hundreds or low thousands. If you have multiple file inputs in your config they'll run in parallel which will increase performance, so one idea could be to have one file per file input.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.