I have a use case where users are permitted to upload xlsx files with
up to 10,000 records per xlsx file. The xlsx file has 9 fields, 4 of
which are numbers (double), and 5 of which are variable length
Accordingly, I was messing around with importing .xlsx files into ES
using the bulk operation and, after tossing out various xml parsers
due to speed issues, came up with the following:
Using this relatively simple command line script, I was able to parse
and bulk load 10,000 records into ES in under 2 seconds -- with about
1 second of the total time coming from parsing the xlsx file.
Caution: This isn't really general purpose (i.e. - no excel date
conversion, only text and numbers, depends on (m)awk), but hopefully
it can help someone along the way.
Note that there are a bunch of xlsx parsers out there that are
probably much "safer" (see, e.g. Apache POI) but I had some spare time
to kill today.....