Bulk API via S3

Jonathan_Spooner · February 27, 2016, 7:46pm

I'm using Spark to build text files for the Bulk API and I staged all 80GB of them in S3. I have 45gb of data and I just realized I can't use curl to POST a file that is on S3!

A) I can write a bash script to download the file and POST it to ES. This is not optimal and I can's ssh into one of the hosted ES nodes and run it from there.

B) I could stream the data with lambda but the data is already in bulk format and I don't want to stream at this time.

I want to bulk load my data set and play with different ES configurations to find the correct cluster size.

I know there are many alternatives but I'd like to see if it's possible to use the bulk api from S3.

Christian_Dahlqvist · February 28, 2016, 3:21am

You typically want to keep the size of your bulk requests to around a few MB in size or smaller, so sending a huge bilk file in a single request is not recommended. Creating a script to read the files and break them up into smaller chunks is one option. You could also use Logstash as it has an S3 input plugin, although you would need to process the data and transform it back into simple records as the Elasticsearch output builds bulk requests internally.

Topic		Replies	Views
Indexing data in bulk in Elasticsearch using PySpark Elasticsearch es-hadoop	1	1348	July 6, 2017
Bulk upload from s3 url Elasticsearch	2	933	July 5, 2017
How can I send large JSON file (6 GB) to Elasticsearch using bulk API? Elasticsearch	5	11848	May 18, 2020
Is it possible to perform bulk insert from Spark to ElasticSearch? Elasticsearch es-hadoop	4	6517	July 6, 2017
Can es-hadoop write bulk files to disk? Elasticsearch es-hadoop	2	745	July 6, 2017

Bulk API via S3

Related topics