Bulk inserting is slow

Thanks to both of you, i'll look at this immediately !

Le mardi 24 juin 2014 17:51:04 UTC+2, Jörg Prante a écrit :

You should use the org.elasticsearch.action.bulk.BulkProcessor helper
class for concurrent bulk indexing.

Jörg

On Tue, Jun 24, 2014 at 5:34 PM, Frederic Esnault <esnault....@gmail.com
<javascript:>> wrote:

Hi again,

any idea about how to parallelize the bulk insert process ?
I tried creating 4 BulkInserters extending RecursiveAction and executed
them all, but the result is awful, 3 of them finished very slowly, and one
did not finish (don't know why), and got only 70K docs in ES instead of 265
000...

The result of downsizing the batches sizes to 10 000 is not really big,
total process took approx. 1 second less (Actually this is much lower than
in the previous post, because i moved the importing UI to my server, close
to one of ES nodes). Was more than 29 seconds, now 28.
28 seconds.

Import CSV file took 28.069 secondes

Here is the insertion code. The Iterator is a CSV reading iterator who
parses lines and returns Record instances (object with generic object
values, indexed as string). MAX_RECORDS is my batch size, set to 10 000.

public void insert(Iterator<Record> recordsIterator) {
    while (recordsIterator.hasNext()) {
        batchInsert(recordsIterator, MAX_RECORDS);
    }
}

private void batchInsert(Iterator<Record> recordsIterator, int limit) 

{
BulkRequestBuilder bulkRequest = client.prepareBulk();
int processed = 0;
try {
logger.log(Level.INFO, "Adding records to bulk insert batch");
while (recordsIterator.hasNext() && processed < limit) {
processed++;
Record record = recordsIterator.next();
IndexRequestBuilder builder =
client.prepareIndex(datasetName, RECORD);
XContentBuilder data = jsonBuilder();
data.startObject();
for (ColumnMetadata column :
dataset.getMetadata().getColumns()) {
Object value =
record.getCell(column.getName()).getValue();
if (value == null || (value instanceof String &&
value.equals("NULL"))) {
value = null;
}
data.field(column.getNormalizedName(), value);
}
data.endObject();
builder.setSource(data);
bulkRequest.add(builder);
}
logger.log(Level.INFO, "Added "+
bulkRequest.numberOfActions() +" records to bulk insert batch. Inserting
batch...");
long current = System.currentTimeMillis();
BulkResponse bulkResponse =
bulkRequest.setConsistencyLevel(WriteConsistencyLevel.ONE).execute().actionGet();
if (bulkResponse.hasFailures()) {
logger.log(Level.SEVERE, "Could not index : " +
bulkResponse.buildFailureMessage());
}
System.out
.println(String.format("Bulk insert took %s
secondes", NumberUtils
.formatSeconds(((double)
(System.currentTimeMillis() - current)) / 1000.0)));
} catch (Exception e) {
e.printStackTrace();
}
}

Le mardi 24 juin 2014 13:44:03 UTC+2, Frederic Esnault a écrit :

Thanks for all this.

I changed my conf, removed all the thread pool config, reduced refresh
time to 5s according to Michael advice, and limited my batch to 10 000.
I'll see how it works then i'll paralellize the bulk insert.
I'll tell you how it ends up.

Thanks again !

Le lundi 23 juin 2014 12:56:14 UTC+2, Jörg Prante a écrit :

Your bulk insert size is too large. It makes no sense to insert 100.000
with one request. Use 1000-10000 instead.

Also you should submit bulk requests in parallel and not sequential
like you do. Sequential bulk is slow if client CPU/network is not saturated.

Check if you have disabled the index refresh from 1 (1s) to -1 while
bulk indexing is active. 30s makes not much sense if you can execute the
bulk in this time.

Do not limit indexing memory to 50%.

It makes no sense to increase queue_size for bulk thread pool to 1000.
This means you want a single ES node should accept 1000 x 100000 = 100 000
000 = 100m docs at once. This will simply exceeds all reasonable limits and
bring the node down with an OOM (if you really have 100m docs).

More advice is possible if you can show your client code how you push
docs to ES.

Jörg

On Mon, Jun 23, 2014 at 12:30 PM, Frederic Esnault <
esnault....@gmail.com> wrote:

Hi everyone,

I'm inserting around 265 000 documents into an Elasticsearch cluster
composed of 3 nodes (real servers).
On two servers i give Elasticsearch 20g of heap, on third one which
has 64g ram, i set 30g of heap for Elasticsearch.

I set Elasticsearch configuration to :

  • 3 shards (1 per server)
  • 0 replicas
  • discovery.zen.ping.multicast.enabled: false (and giving on each
    node the unicast hostnames of the two other nodes);
  • and this :

indices.memory.index_buffer_size: 50%
index.refresh_interval: 30s
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
bulk:
queue_size: 1000
bulk:
type: fixed
size: 30
queue_size: 1000
search:
type: fixed
size: 100
queue_size: 200
get:
type: fixed
size: 100
queue_size: 200

Indexing is done by groups of 100 000 docs, and here is my application
log :
INFO: Adding records to bulk insert batch
INFO: Added 100000 records to bulk insert batch. Inserting batch...
-- Bulk insert took 38.724 secondes
INFO: Adding records to bulk insert batch
INFO: Added 100000 records to bulk insert batch. Inserting batch...
-- Bulk insert took 31.134 secondes
INFO: Adding records to bulk insert batch
INFO: Added 64201 records to bulk insert batch. Inserting batch...
-- Bulk insert took 17.366 secondes

--- Import CSV file took 108.905 secondes ---

I'm wondering if this time is correct or not, or if there is something
i can do to improve performances ?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/3a38e79e-9afb-4146-a7e1-7984ec082e22%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3a38e79e-9afb-4146-a7e1-7984ec082e22%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/32280fd5-5879-4424-882d-5b4e7674557a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/32280fd5-5879-4424-882d-5b4e7674557a%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9fc68d44-caf9-4315-b846-29ac5e1f8988%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.