Working with large files 100gb+

(Michael) #1


Related to my recent threads regarding XML Import into Logstash. I now have my entire import set up, and it runs as expected in the test environment. However, working with production files I am running into an issue that relates to the sheer size of the file I wish to import.

Every friday our government releases a large XML file (currently at 76gb) which contains data of all vehicles ever registered. I am playing with this data in order to become familiar with the Elastic suite.

The XML file mentioned contains a total of +1180417927 lines in XML and a size of 76gb as mentioned. I searched the forum and found a thread where Magnus B├Ąck suggested to load data into a queue and have it consumed by Logstash. In this topic the file was about 2gb in size. Unfortunately I cannot find the thread again.

Anyway, I have not worked with files in this size ever before (I am a web developer moving towards system integrations), and I am puzzled as to what software would be efficient to work with, together with Elastic products. I initially thought Logstash would be effective with this large files.

The error that I get with Logstash is that the message loaded is incomplete, which means that a lot of the registrations is cut off and instead my xpath mapping returns: "Only String and Array types are splittable. field:statistic is of type = NilClass"

I would like to avoid .NET products. Java, Ruby, Python is preferred.

A bit more information regarding the file:
Size: ~75gb+
Total lines: ~1180417927+
Growth: Linear (new entry per. officiel registration)
Encoding: UTF-8
Average size pr. registration: ~100 lines/registration (multiline)

(Michael) #2

I have created a shell script that splits the file into chunks of 1000 registrations. Not sure if this is the most efficient way, however it works.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.