I have a use case where users are permitted to upload xlsx files with
up to 10,000 records per xlsx file. The xlsx file has 9 fields, 4 of
which are numbers (double), and 5 of which are variable length
strings.
Accordingly, I was messing around with importing .xlsx files into ES
using the bulk operation and, after tossing out various xml parsers
due to speed issues, came up with the following:
Using this relatively simple command line script, I was able to parse
and bulk load 10,000 records into ES in under 2 seconds -- with about
1 second of the total time coming from parsing the xlsx file.
Caution: This isn't really general purpose (i.e. - no excel date
conversion, only text and numbers, depends on (m)awk), but hopefully
it can help someone along the way.
Note that there are a bunch of xlsx parsers out there that are
probably much "safer" (see, e.g. Apache POI) but I had some spare time
to kill today.....
On Monday, February 13, 2012 at 11:43 PM, stratawing wrote:
I have a use case where users are permitted to upload xlsx files with
up to 10,000 records per xlsx file. The xlsx file has 9 fields, 4 of
which are numbers (double), and 5 of which are variable length
strings.
Accordingly, I was messing around with importing .xlsx files into ES
using the bulk operation and, after tossing out various xml parsers
due to speed issues, came up with the following:
Using this relatively simple command line script, I was able to parse
and bulk load 10,000 records into ES in under 2 seconds -- with about
1 second of the total time coming from parsing the xlsx file.
Caution: This isn't really general purpose (i.e. - no excel date
conversion, only text and numbers, depends on (m)awk), but hopefully
it can help someone along the way.
Note that there are a bunch of xlsx parsers out there that are
probably much "safer" (see, e.g. Apache POI) but I had some spare time
to kill today.....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.