OCC : ElasticSearch scalability issue when index document with VersionType.EXTERNAL

Hi,

I am trying to do a Bulk update of ElasticSearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
ElasticSearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx[ ]..

If I comment, the setVersion and setVersionType (i.e. use ElasticSearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically pulls
all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Forgot to mention. I am using Elasticsearch version 0.20.6

Dib

On Saturday, June 1, 2013 7:51:42 PM UTC-6, Dibyendu Bhattacharya wrote:

Hi,

I am trying to do a Bulk update of Elasticsearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
Elasticsearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx..

If I comment, the setVersion and setVersionType (i.e. use Elasticsearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically
pulls all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There was an error in below code snippet.

To be more precise , if I use below code (Without call to actionGet()) ,
the Indexing is giving sporadic count in Map Reduce code. There is no issue
with External Version. I am sorry about earlier post. With just execute() ,
all documents not getting indexed.

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute();

But if I add actionGet() as below.. things working fine , all documents
getting indexed.

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

As I understand one is Synchronous and another is Async way of doing the
indexing. Why the Async call failing in Map Reduce code ? Is there anything
to do with how JobTracker start and sop the Task JVM ?

Dib

On Saturday, June 1, 2013 7:51:42 PM UTC-6, Dibyendu Bhattacharya wrote:

Hi,

I am trying to do a Bulk update of Elasticsearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
Elasticsearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx..

If I comment, the setVersion and setVersionType (i.e. use Elasticsearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically
pulls all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.