Is anyway to bulk huge data to ES without rest

dancer · July 6, 2013, 8:24am

HI, thanks you reply, but I don't kown why "actions >= 4096"?
在 2013年6月30日星期日UTC+8上午8时11分40秒，InquiringMind写道：

Ok, Dancer. Here's how you can bulk-load in Java faster and more reliably
than you can possibly imagine. Faster than any other database engine I've
ever used or seen. This text was written back when I was using ES 0.20.4
but it got even faster with version 0.90.0. The result: 90 million
documents in 2 hours and 41 minutes. (And again, note that it gets
measurably faster when using 0.90)

The following is a very skeletal form of the Java-based bulk request
builder. I originally based it on one of Shay's examples. Of course, I've
added much more extensive error checking and statistics tracking to my
production version. But this is enough to give you the basic idea. I never
use curl for bulk loading anymore; doing it in Java is vastly better, there
are no curl limitations to work around, and the statistics are so much
better and more useful during huge testing runs of nearly 100 million
documents.

// Create transport client: Settings should specify cluster.name at least
TransportClient client = new TransportClient(client_settings);

// Add at least one address
InetSocketTransportAddress server_address = new
InetSocketTransportAddress(hostName, port);
client.addTransportAddress(server_address);

// Create initial bulk request builder
BulkRequestBuilder bulkRequest = client.prepareBulk();
bulkRequest.setRefresh(false);

boolean last = false;
for (;
{
// Get next line: action-and-meta-data
// Get next line: source

if (EOF)
{
last = true;
}
else
{
// Call either prepareIndex (for create, index actions),
// or prepareDelete (for delete actions) and set up the
// resulting object as required
 // Add the properly set-up object to the bulk request builder
 bulkRequest.add( resulting-object );
}

// If our bulk limit is reached, or if at the end of the input and,
// some actions remain: Send the accumulated action requests to ES
int actions = bulkRequest.numberOfActions();
if ((actions >= 4096) || (last && actions != 0)))
{
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
 // Handle failures (have only seen them during testing)
 if (bulkResponse.hasFailures())
 {
    for (BulkItemResponse item : bulkResponse.items())
    {
       // Log errors; limiting them to about 128 or so
       // to keep from flooding logs if the entire bulk
       // input is bad for some reason (since I write the
       // converters, failures never happen!!!)
    }
 }

 // Create a new BulkRequestBuilder for the next iteration
 BulkRequestBuilder bulkRequest = client.prepareBulk();
 bulkRequest.setRefresh(false);

 if (last == true)
    break;
}

bulkRequest.setRefresh(true);
}

The index is configured with a 1s refresh. But during the bulk load, the
refresh rate is temporarily changed to 120s (as per a recommendation).

Previously, I had configured 16 shards for this index. But when build
times started climbing to near 4 hours (due partly to the use of the
asciifolding token filter for all of the English language string fields), I
started looking at the shard count. And I noticed that the Elasticsearch
Head interface has seriously ugly alignment issues when it tries to display
shard IDs in double digits (10 through 15). So I wonder if 99.9% of
Elasticsearch users specify less than 11 shards, and I tried an experiment
with 10 shards (IDs 0 through 9).

A serial conversion to JSON and bulk loading of 90 million records with
all index actions (some duplicates) and no delete actions now takes 2:41 (2
hours and 41 minutes). Awesome!

I had thought that Elasticsearch was slowing down as the previous runs
progressed, so I also added the ability to track the counts in each
15-minute window during the build (the size of the window is configuration,
of course!).

Starting: 4096 per bulk-load action
Running totals to be shown at every 15m interval
AT 2013-03-25T21:54:05.537Z :: WINDOW: Total=0 , create=0 , index=0
, delete=0 CURRENT: Total=0
AT 2013-03-25T22:09:05.557Z :: WINDOW: Total=14995457 , create=0 ,
index=14995457 , delete=0 CURRENT: Total=14995457
AT 2013-03-25T22:24:05.615Z :: WINDOW: Total=13635584 , create=0 ,
index=13635584 , delete=0 CURRENT: Total=28631041
AT 2013-03-25T22:39:05.792Z :: WINDOW: Total=13197312 , create=0 ,
index=13197312 , delete=0 CURRENT: Total=41828353
AT 2013-03-25T22:54:05.793Z :: WINDOW: Total=12587184 , create=0 ,
index=12587184 , delete=0 CURRENT: Total=54415537
AT 2013-03-25T23:09:06.677Z :: WINDOW: Total=6508368 , create=0 ,
index=6508368 , delete=0 CURRENT: Total=60923905
AT 2013-03-25T23:24:07.210Z :: WINDOW: Total=3436544 , create=0 ,
index=3436544 , delete=0 CURRENT: Total=64360449
AT 2013-03-25T23:39:07.288Z :: WINDOW: Total=3383296 , create=0 ,
index=3383296 , delete=0 CURRENT: Total=67743745
AT 2013-03-25T23:54:07.443Z :: WINDOW: Total=3407872 , create=0 ,
index=3407872 , delete=0 CURRENT: Total=71151617
AT 2013-03-26T00:09:08.337Z :: WINDOW: Total=3809280 , create=0 ,
index=3809280 , delete=0 CURRENT: Total=74960897
AT 2013-03-26T00:24:08.676Z :: WINDOW: Total=7581696 , create=0 ,
index=7581696 , delete=0 CURRENT: Total=82542593
AT 2013-03-26T00:35:10.585Z :: WINDOW: Total=7924767 , create=0 ,
index=7924767 , delete=0 CURRENT: Total=90467360

SUMMARY: { Total=90467360 , create=0 , index=90467360 , delete=0 }

Done: 90467360 documents in 9665.03312 seconds (02:41:05.033):
9360.274183933681 documents/second

Again, this is all done using one index so I don't need to route updates
based on the index and can just pump them through to ES. This may not be
the best strategy, but it is pushing ES in a direction that I never thought
a database could go when running just on my little old laptop with 8 GB
RAM, quad-core i7, and one relatively slow disk that is both reading the
input data and writing the ES database.

And currently using ES version 0.20.4 with Java 6 (yeah, I know. But
that's out of my control at the moment). However, it still works great! Up
to 54M documents, I was getting an index rate of about 14K documents per
second; for the full 90 million load it averages to a respectable rate of
just over 9K documents per second. Ad-hoc query times seem to be better
with only 10 shards than with the 16 I had been using. And query-by-id is
still stellar.

Brian

On Friday, June 28, 2013 12:56:41 AM UTC-4, dancer wrote:

I want bulk huge data to es. but the current plugin use REST, I think it
may very good for performance. So, is any other way, such as write
lucene use MR?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Looking for advice on bulk loading Elasticsearch	6	946	July 6, 2017
Elasticsearch Performance Analysis Elasticsearch	6	890	July 6, 2017
Not getting Good Write Performance Elasticsearch	6	425	July 6, 2017
Improving Bulk Indexing Elasticsearch	12	4593	July 6, 2017
Help on ---- remote server bulk load Elasticsearch	17	710	July 6, 2017

Is anyway to bulk huge data to ES without rest

SUMMARY: { Total=90467360 , create=0 , index=90467360 , delete=0 }

Related topics