Bulk API via S3

(Jonathan Spooner) #1

I'm using Spark to build text files for the Bulk API and I staged all 80GB of them in S3. I have 45gb of data and I just realized I can't use curl to POST a file that is on S3!

A) I can write a bash script to download the file and POST it to ES. This is not optimal and I can's ssh into one of the hosted ES nodes and run it from there.

B) I could stream the data with lambda but the data is already in bulk format and I don't want to stream at this time.

I want to bulk load my data set and play with different ES configurations to find the correct cluster size.

I know there are many alternatives but I'd like to see if it's possible to use the bulk api from S3.

(Christian Dahlqvist) #2

You typically want to keep the size of your bulk requests to around a few MB in size or smaller, so sending a huge bilk file in a single request is not recommended. Creating a script to read the files and break them up into smaller chunks is one option. You could also use Logstash as it has an S3 input plugin, although you would need to process the data and transform it back into simple records as the Elasticsearch output builds bulk requests internally.

(system) #3