Working with large files 100gb+


Related to my recent threads regarding XML Import into Logstash. I now have my entire import set up, and it runs as expected in the test environment. However, working with production files I am running into an issue that relates to the sheer size of the file I wish to import.

Every friday our government releases a large XML file (currently at 76gb) which contains data of all vehicles ever registered. I am playing with this data in order to become familiar with the Elastic suite.

The XML file mentioned contains a total of +1180417927 lines in XML and a size of 76gb as mentioned. I searched the forum and found a thread where Magnus Bäck suggested to load data into a queue and have it consumed by Logstash. In this topic the file was about 2gb in size. Unfortunately I cannot find the thread again.

Anyway, I have not worked with files in this size ever before (I am a web developer moving towards system integrations), and I am puzzled as to what software would be efficient to work with, together with Elastic products. I initially thought Logstash would be effective with this large files.

The error that I get with Logstash is that the message loaded is incomplete, which means that a lot of the registrations is cut off and instead my xpath mapping returns: "Only String and Array types are splittable. field:statistic is of type = NilClass"

I would like to avoid .NET products. Java, Ruby, Python is preferred.

A bit more information regarding the file:
Size: ~75gb+
Total lines: ~1180417927+
Growth: Linear (new entry per. officiel registration)
Encoding: UTF-8
Average size pr. registration: ~100 lines/registration (multiline)

I have created a shell script that splits the file into chunks of 1000 registrations. Not sure if this is the most efficient way, however it works.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.