Hi,
Related to my recent threads regarding XML Import into Logstash. I now have my entire import set up, and it runs as expected in the test environment. However, working with production files I am running into an issue that relates to the sheer size of the file I wish to import.
Every friday our government releases a large XML file (currently at 76gb) which contains data of all vehicles ever registered. I am playing with this data in order to become familiar with the Elastic suite.
The XML file mentioned contains a total of +1180417927 lines in XML and a size of 76gb as mentioned. I searched the forum and found a thread where Magnus Bäck suggested to load data into a queue and have it consumed by Logstash. In this topic the file was about 2gb in size. Unfortunately I cannot find the thread again.
Anyway, I have not worked with files in this size ever before (I am a web developer moving towards system integrations), and I am puzzled as to what software would be efficient to work with, together with Elastic products. I initially thought Logstash would be effective with this large files.
The error that I get with Logstash is that the message loaded is incomplete, which means that a lot of the registrations is cut off and instead my xpath mapping returns: "Only String and Array types are splittable. field:statistic is of type = NilClass"
I would like to avoid .NET products. Java, Ruby, Python is preferred.
A bit more information regarding the file:
Size: ~75gb+
Total lines: ~1180417927+
Growth: Linear (new entry per. officiel registration)
Encoding: UTF-8
Average size pr. registration: ~100 lines/registration (multiline)