Missing documents after bulk indexing

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog
(https://github.com/infochimps-labs/wonderdog) to do bulk index. There are
247465 documents for indexing into elastic search. However, I notice that
after the indexing, the total number of documents in elastic search by
using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of
documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such
that to insert the data which is missing in the first time indexing into
the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using elastic search... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

If you don't see any failure in bulk response (or in nodes logs), it means that you probably set the same id for some documents and you have 2 docs updated.

Search for docs having _version > 1.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (https://github.com/infochimps-labs/wonderdog) to do bulk index. There are 247465 documents for indexing into elastic search. However, I notice that after the indexing, the total number of documents in elastic search by using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such that to insert the data which is missing in the first time indexing into the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using elastic search... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, David:

Thanks for the quick response. There are 3000 items/documents/records in
one bulk/batch. If a bulk failed, does it mean all those 3000 items are
lost... or any number items in one bulk lost will mark a failure in the
bulk response.

What's the usual reason make the bulk index fail? Is there any way to
automatically repeat indexing the bulk if the bulk is marked failure?

Because I see the small discrepancy, I tried the index job (with the same
dataset) again, and notice that the number is even dropped to 2417461. The
original number is 2417465, and the number after first round of index is
2417463.

Best

-Mao

On Wednesday, May 15, 2013 9:59:31 PM UTC-7, David Pilato wrote:

If you don't see any failure in bulk response (or in nodes logs), it means
that you probably set the same id for some documents and you have 2 docs
updated.

Search for docs having _version > 1.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye <m.dan...@gmail.com <javascript:>> a
écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search) to do bulk index. There are
247465 documents for indexing into Elasticsearch. However, I notice that
after the indexing, the total number of documents in Elasticsearch by
using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of
documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such
that to insert the data which is missing in the first time indexing into
the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using Elasticsearch... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

When one or more request fails in a Bulk, other requests are performed. You will see in BulkResponse a list of each single failures.
It could fail because of a malformed JSon doc or if you try to index a String in a number field or something like that.

So, check your logs first.
Then if you set yourself ids, check that you don't reuse the same id twice.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 08:35, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi, David:

Thanks for the quick response. There are 3000 items/documents/records in one bulk/batch. If a bulk failed, does it mean all those 3000 items are lost... or any number items in one bulk lost will mark a failure in the bulk response.

What's the usual reason make the bulk index fail? Is there any way to automatically repeat indexing the bulk if the bulk is marked failure?

Because I see the small discrepancy, I tried the index job (with the same dataset) again, and notice that the number is even dropped to 2417461. The original number is 2417465, and the number after first round of index is 2417463.

Best

-Mao

On Wednesday, May 15, 2013 9:59:31 PM UTC-7, David Pilato wrote:

If you don't see any failure in bulk response (or in nodes logs), it means that you probably set the same id for some documents and you have 2 docs updated.

Search for docs having _version > 1.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye m.dan...@gmail.com a écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search) to do bulk index. There are 247465 documents for indexing into Elasticsearch. However, I notice that after the indexing, the total number of documents in Elasticsearch by using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such that to insert the data which is missing in the first time indexing into the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using Elasticsearch... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato david@pilato.fr wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In the very beginning when I first noticed this same condition, I was
puzzled too. For instance, my initial load statistics (carefully tracked by
my own Java-based bulk loader) were as follows:

Total=90467360 , create=0 , index=90467360 , delete=0

All index, no delete. And yet, Elasticsearch Head showed: docs: 76216778
So I did a scan, but ignored the source and just showed the meta data
(including version number). A grep -v version=1 soon showed that many
documents contained as many as 20 to 50 duplicate IDs. Very cool of
Elasticsearch to smoothly update.

It also confirmed for me that I have to process my bulk-loads (index only)
and bulk-updates (index+delete) sequentially: The customer supplies the
records in order that they should be applied, and any multi-threading on my
part would violate the customer-specified ordering.

And no matter which version of ES I've used (0.19.4, 0.19.10, 0.20.4, and
now 0.90.0), I get the same record counts after both the initial load and
then followed by the updates when using the same set of data (to ensure a
repeatable data source for logic and performance testing).

Elasticsearch rocks!

On Thursday, May 16, 2013 11:26:25 AM UTC-4, Ivan Brusic wrote:

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato <da...@pilato.fr<javascript:>

wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

On Thu, May 16, 2013 at 9:39 AM, InquiringMind brian.from.fl@gmail.comwrote:

In the very beginning when I first noticed this same condition, I was
puzzled too. For instance, my initial load statistics (carefully tracked by
my own Java-based bulk loader) were as follows:

Total=90467360 , create=0 , index=90467360 , delete=0

All index, no delete. And yet, Elasticsearch Head showed: docs: 76216778
So I did a scan, but ignored the source and just showed the meta data
(including version number). A grep -v version=1 soon showed that many
documents contained as many as 20 to 50 duplicate IDs. Very cool of
Elasticsearch to smoothly update.

It also confirmed for me that I have to process my bulk-loads (index only)
and bulk-updates (index+delete) sequentially: The customer supplies the
records in order that they should be applied, and any multi-threading on my
part would violate the customer-specified ordering.

And no matter which version of ES I've used (0.19.4, 0.19.10, 0.20.4, and
now 0.90.0), I get the same record counts after both the initial load and
then followed by the updates when using the same set of data (to ensure a
repeatable data source for logic and performance testing).

Elasticsearch rocks!

On Thursday, May 16, 2013 11:26:25 AM UTC-4, Ivan Brusic wrote:

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato da...@pilato.fr wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load
tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future. :slight_smile:

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Thanks so much for those useful comments and suggestions. According to
the advices,

  1. I changed the wonderdog jar and make sure to check the BulkResponse for
    every batch, the code looks like this:
    if(response.hasFailures())
    {
    throw new RuntimeException("BulkResponse show
    failures: " + response.buildFailureMessage());
    }

I add this code to make sure that if there is a failure in bulk response,
then just abort.

  1. Before I do the index, I delete all the existing index.

  2. After I start the index job, I didn't catch any errors. However, I
    notice that the documents been indexed is 1 less than the total number I
    should expect (Note that I am sure there is no duplicate)

  3. After 30 minutes, I notice that the number of documents even decreased
    by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should
provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order
to make the indexing faster. is it possible that replica=0 cause this
problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load
tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future. :slight_smile:

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi, Thanks so much for those useful comments and suggestions. According to the advices,

  1. I changed the wonderdog jar and make sure to check the BulkResponse for every batch, the code looks like this:
    if(response.hasFailures())
    {
    throw new RuntimeException("BulkResponse show failures: " + response.buildFailureMessage());
    }

I add this code to make sure that if there is a failure in bulk response, then just abort.

  1. Before I do the index, I delete all the existing index.

  2. After I start the index job, I didn't catch any errors. However, I notice that the documents been indexed is 1 less than the total number I should expect (Note that I am sure there is no duplicate)

  3. After 30 minutes, I notice that the number of documents even decreased by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order to make the indexing faster. is it possible that replica=0 cause this problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:
Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load tool to track this statistic (updated documents with a version number greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is especially valuable when I need to export an index, then delete that index and reload it for whatever reason. In this case, I preserve the version number; if _version is set to 2 or higher in an "index" or "create" action, it is treated as an EXTERNAL version. This not only preserves the original data and previously automatically-generated _id values, but also the version numbers. Of course, in this case that statistic will tell me how many version numbers were set to a value greater than 1 which gives me an indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the future. :slight_smile:

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:
Just to add some pseudo-ish code to what I wrote before since the BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are referring to. I do some sanity-checking with code similar to the one I provided above.

Since you are using wonderdog, you would have to change its source for the verification check above. Looking at the wonderdog source code, it appears that they are not using Elasticsearch's newish BulkProcessor, so the problem might exist in their code (I didn't look too deeply at how the handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I tried to add more replica, however it does not help.

I have another confusing observation:

If I keep indexing the same dataset, for the first time indexing, I got X1
number of documents indexed;
however for the second round of index, I got X2 number of documents
indexed. And X2 < X1.

Theoretically, X2 should be equal to X1 ... What's possible reason would
cause this problem? Thanks in advance!

Best

-Mao

On Thursday, May 16, 2013 9:23:05 PM UTC-7, David Pilato wrote:

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye <m.dan...@gmail.com <javascript:>> a écrit
:

Hi, Thanks so much for those useful comments and suggestions. According to
the advices,

  1. I changed the wonderdog jar and make sure to check the BulkResponse for
    every batch, the code looks like this:
    if(response.hasFailures())
    {
    throw new RuntimeException("BulkResponse show
    failures: " + response.buildFailureMessage());
    }

I add this code to make sure that if there is a failure in bulk response,
then just abort.

  1. Before I do the index, I delete all the existing index.

  2. After I start the index job, I didn't catch any errors. However, I
    notice that the documents been indexed is 1 less than the total number I
    should expect (Note that I am sure there is no duplicate)

  3. After 30 minutes, I notice that the number of documents even decreased
    by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should
provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order
to make the indexing faster. is it possible that replica=0 cause this
problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk
load tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future. :slight_smile:

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I don't see any.
Perhaps turning logs to debug could help?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 17 mai 2013 à 08:36, Mao Ye m.daniel.ye@gmail.com a écrit :

I tried to add more replica, however it does not help.

I have another confusing observation:

If I keep indexing the same dataset, for the first time indexing, I got X1 number of documents indexed;
however for the second round of index, I got X2 number of documents indexed. And X2 < X1.

Theoretically, X2 should be equal to X1 ... What's possible reason would cause this problem? Thanks in advance!

Best

-Mao

On Thursday, May 16, 2013 9:23:05 PM UTC-7, David Pilato wrote:

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye m.dan...@gmail.com a écrit :

Hi, Thanks so much for those useful comments and suggestions. According to the advices,

  1. I changed the wonderdog jar and make sure to check the BulkResponse for every batch, the code looks like this:
    if(response.hasFailures())
    {
    throw new RuntimeException("BulkResponse show failures: " + response.buildFailureMessage());
    }

I add this code to make sure that if there is a failure in bulk response, then just abort.

  1. Before I do the index, I delete all the existing index.

  2. After I start the index job, I didn't catch any errors. However, I notice that the documents been indexed is 1 less than the total number I should expect (Note that I am sure there is no duplicate)

  3. After 30 minutes, I notice that the number of documents even decreased by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order to make the indexing faster. is it possible that replica=0 cause this problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load tool to track this statistic (updated documents with a version number greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is especially valuable when I need to export an index, then delete that index and reload it for whatever reason. In this case, I preserve the version number; if _version is set to 2 or higher in an "index" or "create" action, it is treated as an EXTERNAL version. This not only preserves the original data and previously automatically-generated _id values, but also the version numbers. Of course, in this case that statistic will tell me how many version numbers were set to a value greater than 1 which gives me an indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the future. :slight_smile:

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are referring to. I do some sanity-checking with code similar to the one I provided above.

Since you are using wonderdog, you would have to change its source for the verification check above. Looking at the wonderdog source code, it appears that they are not using Elasticsearch's newish BulkProcessor, so the problem might exist in their code (I didn't look too deeply at how the handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hi,have you resolve your missing documents problem?I have the same problem.thank you .