OCC : ElasticSearch scalability issue when index document with VersionType.EXTERNAL

Dibyendu_Bhattachary · June 2, 2013, 1:51am

Hi,

I am trying to do a Bulk update of ElasticSearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
ElasticSearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx[ ]..

If I comment, the setVersion and setVersionType (i.e. use ElasticSearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically pulls
all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dibyendu_Bhattachary · June 2, 2013, 1:52am

Forgot to mention. I am using Elasticsearch version 0.20.6

Dib

On Saturday, June 1, 2013 7:51:42 PM UTC-6, Dibyendu Bhattacharya wrote:

Hi,

I am trying to do a Bulk update of Elasticsearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
Elasticsearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx..

If I comment, the setVersion and setVersionType (i.e. use Elasticsearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically
pulls all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dibyendu_Bhattachary · June 2, 2013, 3:33am

There was an error in below code snippet.

To be more precise , if I use below code (Without call to actionGet()) ,
the Indexing is giving sporadic count in Map Reduce code. There is no issue
with External Version. I am sorry about earlier post. With just execute() ,
all documents not getting indexed.

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute();

But if I add actionGet() as below.. things working fine , all documents
getting indexed.

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

As I understand one is Synchronous and another is Async way of doing the
indexing. Why the Async call failing in Map Reduce code ? Is there anything
to do with how JobTracker start and sop the Task JVM ?

Dib

On Saturday, June 1, 2013 7:51:42 PM UTC-6, Dibyendu Bhattacharya wrote:

Hi,

I am trying to do a Bulk update of Elasticsearch index using MapReduce job
. I am using TransportClient.

Things are working fine, and all documents got index, when I am using
Elasticsearch internal Version control .

But I wanted to propagate the version from external source (in my case it
is HBase, and the Map Reduce is doing Indexing of HBase columns). If I use
the external version, the index behavior become sporadic. Not all documents
get indexed, and individual run of Map Reduce job shows different number of
documents being indexed.

Below is the code.. The client is TransPortClient, and I am getting the
version from HBase, which is in idx..

If I comment, the setVersion and setVersionType (i.e. use Elasticsearch
internal version), things works fine.

Below code is executing within my Reduce task. And Map task basically
pulls all data from HBase and give it to Reduce..

client.prepareIndex("index", "product", idx[0])
.setVersion(Long.parseLong(idx[1]))
.setVersionType(VersionType.EXTERNAL)
.setOperationThreaded(false).setSource(builder.string())
.execute().actionGet();

Do let me know, if there is any issue with this approach.

Regards,
Dibyendu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Using updates with external versioning Elasticsearch	4	1574	July 6, 2017
Setting the _version value using the Java (actually any) API? Elasticsearch	2	280	July 6, 2017
Why does update not allow version_type to be external Elasticsearch	1	531	July 6, 2017
Resetting versions Elasticsearch	2	301	March 14, 2023
Use a timestamp to version documents in an index ingested with 'org.elasticsearch.hadoop.hive.EsStorageHandler' Elasticsearch	2	717	January 31, 2018

OCC : ElasticSearch scalability issue when index document with VersionType.EXTERNAL

Related topics