Command line script to bulk load .xlsx data into ES


(stratawing) #1

I have a use case where users are permitted to upload xlsx files with
up to 10,000 records per xlsx file. The xlsx file has 9 fields, 4 of
which are numbers (double), and 5 of which are variable length
strings.

Accordingly, I was messing around with importing .xlsx files into ES
using the bulk operation and, after tossing out various xml parsers
due to speed issues, came up with the following:

Using this relatively simple command line script, I was able to parse
and bulk load 10,000 records into ES in under 2 seconds -- with about
1 second of the total time coming from parsing the xlsx file.

Caution: This isn't really general purpose (i.e. - no excel date
conversion, only text and numbers, depends on (m)awk), but hopefully
it can help someone along the way.

Note that there are a bunch of xlsx parsers out there that are
probably much "safer" (see, e.g. Apache POI) but I had some spare time
to kill today.....

Cheers!


(Shay Banon) #2

Thanks for sharing!

On Monday, February 13, 2012 at 11:43 PM, stratawing wrote:

I have a use case where users are permitted to upload xlsx files with
up to 10,000 records per xlsx file. The xlsx file has 9 fields, 4 of
which are numbers (double), and 5 of which are variable length
strings.

Accordingly, I was messing around with importing .xlsx files into ES
using the bulk operation and, after tossing out various xml parsers
due to speed issues, came up with the following:

https://gist.github.com/1820634

Using this relatively simple command line script, I was able to parse
and bulk load 10,000 records into ES in under 2 seconds -- with about
1 second of the total time coming from parsing the xlsx file.

Caution: This isn't really general purpose (i.e. - no excel date
conversion, only text and numbers, depends on (m)awk), but hopefully
it can help someone along the way.

Note that there are a bunch of xlsx parsers out there that are
probably much "safer" (see, e.g. Apache POI) but I had some spare time
to kill today.....

Cheers!


(system) #3