Indexing Performance, Threads + Bulk Size

davrob · January 17, 2011, 9:34am

Hi,

I've been indexing about 90,000 documents in about 8 minutes. Is that
good performance for elastic search? I think other people have got
much better. I've based my indexing on the wikipedia river. Is it
possible to do the following:

Is bulk updates the way to go for performance or is iterating
though individual saves as fast?
Can I increase the bulk size from 100 to 1000 or 10,000?
Will several concurrent thread adding bulk updates to the same
master node help performance?

thanks, my current update code is below.

David.

  public Boolean bulkSave(JsonBean jsonBean, BulkRequestBuilder

currentRequest) {

            IndexRequest indexOperation =

Requests.indexRequest(indexName)
.type(jsonBean.getIndexType())
.id("" + jsonBean.getId())
.create(false)
.source(jsonBean.getJson());

			currentRequest.add(indexOperation);
            Boolean createNew =

processBulkIfNeeded(currentRequest, processedAccount++);
return createNew;

    }

    private Boolean processBulkIfNeeded(BulkRequestBuilder

currentRequest, int processedAccounts) {
Boolean createNewBulkReq = false;
if (currentRequest.numberOfActions() >= bulkSize) {
// execute the bulk operation
int currentOnGoingBulks =
onGoingBulks.incrementAndGet();
LOG.info("Ongoing Bulks = " + currentOnGoingBulks + "
processed Accounts = " + processedAccounts );
if (currentOnGoingBulks > dropThreshold) {
// TODO, just wait here!, we can slow down the
parsing
onGoingBulks.decrementAndGet();
LOG.error("dropping bulk, " + onGoingBulks + "
crossed threshold " + dropThreshold + " processed Accounts = " +
processedAccounts);
} else {
try {
final int bulkNo = onGoingBulks.get();
LOG.info("Executing Bulk Request " + bulkNo );
currentRequest.execute(new
ActionListener() {
@Override public void
onResponse(BulkResponse bulkResponse) {
onGoingBulks.decrementAndGet();
LOG.info("Bulk [" + bulkNo + "]
Executed");
}

                        @Override public void onFailure(Throwable

e) {
LOG.error("Error in Bulk [" + bulkNo +
"]", e);
}
});
} catch (Exception e) {
LOG.error("failed to process bulk", e);
}
}
// once we have executed a bulk request, create a new
one for adding
// a fresh set of bulk updates, deletes, additions
etc.
createNewBulkReq = true;

        }
		return createNewBulkReq;
    }

kimchy · January 17, 2011, 5:13pm

I don't read code on mails (formatting is not kept), if you want, you can gist it and I can have a look.

Indexing speed depends a lot on the document you index, and how big it is. There is a big difference between indexing tweets to indexing attachments.

You can certainly use more than one thread doing the bulk indexing, it will improve the indexing performance. As to the size of the batch, ES will need to represent the batch request in memory, so you need to make sure not to concurrently send too much data over the network or have the JVM run out of memory.

-shay.banon
On Monday, January 17, 2011 at 11:34 AM, dalesrob wrote:

Hi,

I've been indexing about 90,000 documents in about 8 minutes. Is that
good performance for Elasticsearch? I think other people have got
much better. I've based my indexing on the wikipedia river. Is it
possible to do the following:

Is bulk updates the way to go for performance or is iterating
though individual saves as fast?

Can I increase the bulk size from 100 to 1000 or 10,000?

Will several concurrent thread adding bulk updates to the same
master node help performance?

thanks, my current update code is below.

David.

public Boolean bulkSave(JsonBean jsonBean, BulkRequestBuilder
currentRequest) {

IndexRequest indexOperation =
Requests.indexRequest(indexName)
.type(jsonBean.getIndexType())
.id("" + jsonBean.getId())
.create(false)
.source(jsonBean.getJson());

currentRequest.add(indexOperation);
Boolean createNew =
processBulkIfNeeded(currentRequest, processedAccount++);
return createNew;

}

private Boolean processBulkIfNeeded(BulkRequestBuilder
currentRequest, int processedAccounts) {
Boolean createNewBulkReq = false;
if (currentRequest.numberOfActions() >= bulkSize) {
// execute the bulk operation
int currentOnGoingBulks =
onGoingBulks.incrementAndGet();
LOG.info("Ongoing Bulks = " + currentOnGoingBulks + "
processed Accounts = " + processedAccounts );
if (currentOnGoingBulks > dropThreshold) {
// TODO, just wait here!, we can slow down the
parsing
onGoingBulks.decrementAndGet();
LOG.error("dropping bulk, " + onGoingBulks + "
crossed threshold " + dropThreshold + " processed Accounts = " +
processedAccounts);
} else {
try {
final int bulkNo = onGoingBulks.get();
LOG.info("Executing Bulk Request " + bulkNo );
currentRequest.execute(new
ActionListener() {
@Override public void
onResponse(BulkResponse bulkResponse) {
onGoingBulks.decrementAndGet();
LOG.info("Bulk [" + bulkNo + "]
Executed");
}

@Override public void onFailure(Throwable
e) {
LOG.error("Error in Bulk [" + bulkNo +
"]", e);
}
});
} catch (Exception e) {
LOG.error("failed to process bulk", e);
}
}
// once we have executed a bulk request, create a new
one for adding
// a fresh set of bulk updates, deletes, additions
etc.
createNewBulkReq = true;

}
return createNewBulkReq;
}

Topic		Replies	Views
Questions --- Regarding to Size of Bulk Load., Capasity of a Shards and Performance Elasticsearch	3	366	July 6, 2017
Update/Upsert Performance Improvements Elasticsearch	8	9488	July 5, 2017
Indexing performance using transport client Elasticsearch	16	1410	July 31, 2017
Bulk throughput issues Elasticsearch	15	1674	July 6, 2017
Bulk Indexing Rate Elasticsearch	4	552	April 18, 2018

Indexing Performance, Threads + Bulk Size

Related topics