Working with large files 100gb+

mrncgirc · October 31, 2018, 9:06am

Hi,

Related to my recent threads regarding XML Import into Logstash. I now have my entire import set up, and it runs as expected in the test environment. However, working with production files I am running into an issue that relates to the sheer size of the file I wish to import.

Every friday our government releases a large XML file (currently at 76gb) which contains data of all vehicles ever registered. I am playing with this data in order to become familiar with the Elastic suite.

The XML file mentioned contains a total of +1180417927 lines in XML and a size of 76gb as mentioned. I searched the forum and found a thread where Magnus Bäck suggested to load data into a queue and have it consumed by Logstash. In this topic the file was about 2gb in size. Unfortunately I cannot find the thread again.

Anyway, I have not worked with files in this size ever before (I am a web developer moving towards system integrations), and I am puzzled as to what software would be efficient to work with, together with Elastic products. I initially thought Logstash would be effective with this large files.

The error that I get with Logstash is that the message loaded is incomplete, which means that a lot of the registrations is cut off and instead my xpath mapping returns: "Only String and Array types are splittable. field:statistic is of type = NilClass"

I would like to avoid .NET products. Java, Ruby, Python is preferred.

A bit more information regarding the file:
Size: ~75gb+
Total lines: ~1180417927+
Growth: Linear (new entry per. officiel registration)
Encoding: UTF-8
Average size pr. registration: ~100 lines/registration (multiline)

mrncgirc · October 31, 2018, 10:31am

I have created a shell script that splits the file into chunks of 1000 registrations. Not sure if this is the most efficient way, however it works.

system · November 28, 2018, 10:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing big xml files in logstash Logstash	1	250	November 30, 2022
Xml-file too big for logstash? Logstash	28	2773	June 10, 2019
Logstash not process large XML with multiple xpath Logstash	4	603	October 2, 2019
Large XML crashes logstash with OOM Logstash	5	2716	July 6, 2017
Working with large XML's - logstash/elastic can not handle Logstash	2	653	January 4, 2018

Working with large files 100gb+

Related topics