I need to read approximately 1 million xml files, extract 2 specific fields (title and abstract) and load them into a named elasticsearch index with corresponding fields.
I looked at the File input plug-in , but it suggests that it is not designed for reading in a whole file from the beginning.
Can anyone direct me to a good example of something like this to get me started.
I'd write a script to read the XML files and somehow get them to Logstash, perhaps by serializing each XML file to a single line and writing to a much smaller number of files that Logstash's file input will have no problems reading. Directing Logstash's file input directly at the million XML files with a giant wildcard will not end well.
Ha ha - thanks Magnus. When you say 'somehow get them to Logstash', that is the bit I am unsure of
Each document will represent a document in elasticsearch with fields for the abstract and title and full text (i.e. the whole file). I have a java program to perform the entire job, but I was keen to take advantage of Logstash's stop/restart control if something goes wrong along the way.
Which option do you think is best:
Use my own option as Logstash is not suited to this kind of problem
Use an existing plugin with Logstash (please let me know which one)
Write my own plugin for logstash - is this straightforward?
I think option four that I described earlier is best:
Use a program to read the XML files and append them in serialized form (i.e. one file per line) to a small number of files that Logstash's file input can read and process
I'd keep it in the hundreds or low thousands. If you have multiple file inputs in your config they'll run in parallel which will increase performance, so one idea could be to have one file per file input.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.