Bulk insertion taking long and throwing lots of errors

Hi there. Well, getting my feet wet with ES. I'm doing a small poc here,
and I created an index with default shard/replica values (did not touch
anything). The only thing I changed was the analyzer:

"albums" : {
"settings" : {
"index.analysis.analyzer.text_en.filter.2" : "porterStem",
"index.analysis.analyzer.text_en.filter.1" : "lowercase",
"index.analysis.analyzer.text_en.tokenizer" : "standard",
"index.analysis.analyzer.text_en.type" : "custom",
"index.analysis.analyzer.text_en.filter.0" : "standard",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
}

So I'm trying to index a small amount of documents using the BulkRequest,
but it's taking forever (almost 1 min). And when it finishes it fails most
of requests, I got only 376 docs indexed out of 1000, and a lot
of UnavailableShardsException[[albums][2] [2] shardIt, [0] active : Timeout
waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@db766c1].

I would imagine this is related to the shard config. I'm only running using
a single instance for now.

The code for the bulk load:

while(rs.next() && count < max){

count++;

XContentBuilder content = createContent(rs);

bulkRequest.add(client.prepareIndex("albums", "album"
).setSource(content));

if(count%1000 == 0){

System.out.println("commit");

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

if(bulkResponse.hasFailures()){

for(BulkItemResponse r : bulkResponse.items()){

System.out.println(r.getFailureMessage());

}

System.out.println("ERROR");

  }

}

}

How can I make it faster for a single node POC?

Regards

After every bulk insert, re-initializing the bulkRequest variable:

bRequest = client.prepareBulk();

Set refresh rate = -1
Set no of replicas to 0

Let me know the results!

Thanks,
Shoeb :slight_smile:

Hi! fixed this by changing the shard/replica values. Still a bit slower
when compared to Solr DataImporthandler

Besides that, tons of errors on duplicateddocuments are arising, which
makes no sense, as I'm not sending duplicate documents. Only thing I could
think is that ids are being duplicated (I'm not sending ids, instead, I'm
relying on automatic generation of id)

Still not getting why.

Any ideas?

REgards

On Wednesday, July 11, 2012 3:52:53 PM UTC-4, Vinicius Carvalho wrote:

Hi there. Well, getting my feet wet with ES. I'm doing a small poc here,
and I created an index with default shard/replica values (did not touch
anything). The only thing I changed was the analyzer:

"albums" : {
"settings" : {
"index.analysis.analyzer.text_en.filter.2" : "porterStem",
"index.analysis.analyzer.text_en.filter.1" : "lowercase",
"index.analysis.analyzer.text_en.tokenizer" : "standard",
"index.analysis.analyzer.text_en.type" : "custom",
"index.analysis.analyzer.text_en.filter.0" : "standard",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
}

So I'm trying to index a small amount of documents using the BulkRequest,
but it's taking forever (almost 1 min). And when it finishes it fails most
of requests, I got only 376 docs indexed out of 1000, and a lot
of UnavailableShardsException[[albums][2] [2] shardIt, [0] active : Timeout
waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@db766c1].

I would imagine this is related to the shard config. I'm only running
using a single instance for now.

The code for the bulk load:

while(rs.next() && count < max){

count++;

XContentBuilder content = createContent(rs);

bulkRequest.add(client.prepareIndex("albums", "album"
).setSource(content));

if(count%1000 == 0){

System.out.println("commit");

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

if(bulkResponse.hasFailures()){

for(BulkItemResponse r : bulkResponse.items()){

System.out.println(r.getFailureMessage());

}

System.out.println("ERROR");

  }

}

}

How can I make it faster for a single node POC?

Regards

Well, problem was the automatic generation of id, I really would like to
know why, I was expecting it to be safe :frowning:

One last thing, the process to index is now running, but again, taking a
long, long time. Solr indexed the same data (1.7M records) in under 3 min.
I wonder why ES is taking so long (so far, I'm sending bulk updates at
every 10k docs read from DB) and each takes around 25sec to be indexed.

Any ideas on how to boost performance?

Regards

On Wednesday, July 11, 2012 4:55:16 PM UTC-4, Vinicius Carvalho wrote:

Hi! fixed this by changing the shard/replica values. Still a bit slower
when compared to Solr DataImporthandler

Besides that, tons of errors on duplicateddocuments are arising, which
makes no sense, as I'm not sending duplicate documents. Only thing I could
think is that ids are being duplicated (I'm not sending ids, instead, I'm
relying on automatic generation of id)

Still not getting why.

Any ideas?

REgards

On Wednesday, July 11, 2012 3:52:53 PM UTC-4, Vinicius Carvalho wrote:

Hi there. Well, getting my feet wet with ES. I'm doing a small poc here,
and I created an index with default shard/replica values (did not touch
anything). The only thing I changed was the analyzer:

"albums" : {
"settings" : {
"index.analysis.analyzer.text_en.filter.2" : "porterStem",
"index.analysis.analyzer.text_en.filter.1" : "lowercase",
"index.analysis.analyzer.text_en.tokenizer" : "standard",
"index.analysis.analyzer.text_en.type" : "custom",
"index.analysis.analyzer.text_en.filter.0" : "standard",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
}

So I'm trying to index a small amount of documents using the BulkRequest,
but it's taking forever (almost 1 min). And when it finishes it fails most
of requests, I got only 376 docs indexed out of 1000, and a lot
of UnavailableShardsException[[albums][2] [2] shardIt, [0] active : Timeout
waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@db766c1].

I would imagine this is related to the shard config. I'm only running
using a single instance for now.

The code for the bulk load:

while(rs.next() && count < max){

count++;

XContentBuilder content = createContent(rs);

bulkRequest.add(client.prepareIndex("albums", "album"
).setSource(content));

if(count%1000 == 0){

System.out.println("commit");

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

if(bulkResponse.hasFailures()){

for(BulkItemResponse r : bulkResponse.items()){

System.out.println(r.getFailureMessage());

}

System.out.println("ERROR");

  }

}

}

How can I make it faster for a single node POC?

Regards

Hi Vinicius,

If you want to insert a huge number of docs for one time, it might be worth
disabling the automatic refresh, insert all the docs (in that time you
won't see them in searches), then enable the refresh back again.

Take a look here, towards the end, when it says "Bulk indexing usage":

On Thursday, July 12, 2012 12:14:30 AM UTC+3, Vinicius Carvalho wrote:

Well, problem was the automatic generation of id, I really would like to
know why, I was expecting it to be safe :frowning:

One last thing, the process to index is now running, but again, taking a
long, long time. Solr indexed the same data (1.7M records) in under 3 min.
I wonder why ES is taking so long (so far, I'm sending bulk updates at
every 10k docs read from DB) and each takes around 25sec to be indexed.

Any ideas on how to boost performance?

Regards

On Wednesday, July 11, 2012 4:55:16 PM UTC-4, Vinicius Carvalho wrote:

Hi! fixed this by changing the shard/replica values. Still a bit slower
when compared to Solr DataImporthandler

Besides that, tons of errors on duplicateddocuments are arising, which
makes no sense, as I'm not sending duplicate documents. Only thing I could
think is that ids are being duplicated (I'm not sending ids, instead, I'm
relying on automatic generation of id)

Still not getting why.

Any ideas?

REgards

On Wednesday, July 11, 2012 3:52:53 PM UTC-4, Vinicius Carvalho wrote:

Hi there. Well, getting my feet wet with ES. I'm doing a small poc here,
and I created an index with default shard/replica values (did not touch
anything). The only thing I changed was the analyzer:

"albums" : {
"settings" : {
"index.analysis.analyzer.text_en.filter.2" : "porterStem",
"index.analysis.analyzer.text_en.filter.1" : "lowercase",
"index.analysis.analyzer.text_en.tokenizer" : "standard",
"index.analysis.analyzer.text_en.type" : "custom",
"index.analysis.analyzer.text_en.filter.0" : "standard",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
}

So I'm trying to index a small amount of documents using the
BulkRequest, but it's taking forever (almost 1 min). And when it finishes
it fails most of requests, I got only 376 docs indexed out of 1000, and a
lot of UnavailableShardsException[[albums][2] [2] shardIt, [0] active :
Timeout waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@db766c1].

I would imagine this is related to the shard config. I'm only running
using a single instance for now.

The code for the bulk load:

while(rs.next() && count < max){

count++;

XContentBuilder content = createContent(rs);

bulkRequest.add(client.prepareIndex("albums", "album"
).setSource(content));

if(count%1000 == 0){

System.out.println("commit");

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

if(bulkResponse.hasFailures()){

for(BulkItemResponse r : bulkResponse.items()){

System.out.println(r.getFailureMessage());

}

System.out.println("ERROR");

  }

}

}

How can I make it faster for a single node POC?

Regards

From your loop code, it is not obvious to see whether you clear the
bulkrequest after the bulkresponse had come back. If you do not clear, you
will send already sent docs again and again.

'UnavailableShardsException' means Elasticsearch did not found a shard, but
you addressed it in indexing.

From your configuration it seems you are on a single node with replica
setting 1, means your cluster state is yellow. Please take care of
this. You will be able to index on a single node without errors with
replica setting 0 and with a cluster state of green.

Can you provide us with numbers what you compare so we can follow your
statement "Still a bit slower when compared to Solr DataImporthandler" in
more detail? Thanks.

Best regards,

Jörg

On Wednesday, July 11, 2012 10:55:16 PM UTC+2, Vinicius Carvalho wrote:

Hi! fixed this by changing the shard/replica values. Still a bit slower
when compared to Solr DataImporthandler

Besides that, tons of errors on duplicateddocuments are arising, which
makes no sense, as I'm not sending duplicate documents. Only thing I could
think is that ids are being duplicated (I'm not sending ids, instead, I'm
relying on automatic generation of id)

Still not getting why.

Any ideas?

REgards

On Wednesday, July 11, 2012 3:52:53 PM UTC-4, Vinicius Carvalho wrote:

Hi there. Well, getting my feet wet with ES. I'm doing a small poc here,
and I created an index with default shard/replica values (did not touch
anything). The only thing I changed was the analyzer:

"albums" : {
"settings" : {
"index.analysis.analyzer.text_en.filter.2" : "porterStem",
"index.analysis.analyzer.text_en.filter.1" : "lowercase",
"index.analysis.analyzer.text_en.tokenizer" : "standard",
"index.analysis.analyzer.text_en.type" : "custom",
"index.analysis.analyzer.text_en.filter.0" : "standard",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "190899"
}
}

So I'm trying to index a small amount of documents using the BulkRequest,
but it's taking forever (almost 1 min). And when it finishes it fails most
of requests, I got only 376 docs indexed out of 1000, and a lot
of UnavailableShardsException[[albums][2] [2] shardIt, [0] active : Timeout
waiting for [1m], request:
org.elasticsearch.action.bulk.BulkShardRequest@db766c1].

I would imagine this is related to the shard config. I'm only running
using a single instance for now.

The code for the bulk load:

while(rs.next() && count < max){

count++;

XContentBuilder content = createContent(rs);

bulkRequest.add(client.prepareIndex("albums", "album"
).setSource(content));

if(count%1000 == 0){

System.out.println("commit");

BulkResponse bulkResponse = bulkRequest.execute().actionGet();

if(bulkResponse.hasFailures()){

for(BulkItemResponse r : bulkResponse.items()){

System.out.println(r.getFailureMessage());

}

System.out.println("ERROR");

  }

}

}

How can I make it faster for a single node POC?

Regards