Missing documents after bulk indexing

Mao_Ye · May 16, 2013, 4:03am

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog
(https://github.com/infochimps-labs/wonderdog) to do bulk index. There are
247465 documents for indexing into elastic search. However, I notice that
after the indexing, the total number of documents in elastic search by
using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of
documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such
that to insert the data which is missing in the first time indexing into
the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using elastic search... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 16, 2013, 4:59am

If you don't see any failure in bulk response (or in nodes logs), it means that you probably set the same id for some documents and you have 2 docs updated.

Search for docs having _version > 1.

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (https://github.com/infochimps-labs/wonderdog) to do bulk index. There are 247465 documents for indexing into elastic search. However, I notice that after the indexing, the total number of documents in elastic search by using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such that to insert the data which is missing in the first time indexing into the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using elastic search... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mao_Ye · May 16, 2013, 6:35am

Hi, David:

Thanks for the quick response. There are 3000 items/documents/records in
one bulk/batch. If a bulk failed, does it mean all those 3000 items are
lost... or any number items in one bulk lost will mark a failure in the
bulk response.

What's the usual reason make the bulk index fail? Is there any way to
automatically repeat indexing the bulk if the bulk is marked failure?

Because I see the small discrepancy, I tried the index job (with the same
dataset) again, and notice that the number is even dropped to 2417461. The
original number is 2417465, and the number after first round of index is
2417463.

Best

-Mao

On Wednesday, May 15, 2013 9:59:31 PM UTC-7, David Pilato wrote:

If you don't see any failure in bulk response (or in nodes logs), it means
that you probably set the same id for some documents and you have 2 docs
updated.

Search for docs having _version > 1.

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye <m.dan...@gmail.com <javascript:>> a
écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (
GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search) to do bulk index. There are
247465 documents for indexing into Elasticsearch. However, I notice that
after the indexing, the total number of documents in Elasticsearch by
using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of
documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such
that to insert the data which is missing in the first time indexing into
the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using Elasticsearch... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 16, 2013, 6:42am

When one or more request fails in a Bulk, other requests are performed. You will see in BulkResponse a list of each single failures.
It could fail because of a malformed JSon doc or if you try to index a String in a number field or something like that.

So, check your logs first.
Then if you set yourself ids, check that you don't reuse the same id twice.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 08:35, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi, David:

Thanks for the quick response. There are 3000 items/documents/records in one bulk/batch. If a bulk failed, does it mean all those 3000 items are lost... or any number items in one bulk lost will mark a failure in the bulk response.

What's the usual reason make the bulk index fail? Is there any way to automatically repeat indexing the bulk if the bulk is marked failure?

Because I see the small discrepancy, I tried the index job (with the same dataset) again, and notice that the number is even dropped to 2417461. The original number is 2417465, and the number after first round of index is 2417463.

Best

-Mao

On Wednesday, May 15, 2013 9:59:31 PM UTC-7, David Pilato wrote:

If you don't see any failure in bulk response (or in nodes logs), it means that you probably set the same id for some documents and you have 2 docs updated.

Search for docs having _version > 1.

HTH

David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 mai 2013 à 06:03, Mao Ye m.dan...@gmail.com a écrit :

Hi,

I am using elasticsearch-0.90.0.jar, and wonderdog (GitHub - infochimps-labs/wonderdog: Bulk loading for elastic search) to do bulk index. There are 247465 documents for indexing into Elasticsearch. However, I notice that after the indexing, the total number of documents in Elasticsearch by using
{
"query": {
"match_all": {}
}
}

is 2417463.

The number 2417463 is a little bit smaller than the original number of documents 2417465.

I tried to use the wonderdog/bulk indexer to index the data again, such that to insert the data which is missing in the first time indexing into the index. It doesn't necessary help me get the number of 247465.

Not sure whether this is normal behavior using Elasticsearch... Thanks!

Best

-Mao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · May 16, 2013, 3:26pm

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato david@pilato.fr wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · May 16, 2013, 4:39pm

In the very beginning when I first noticed this same condition, I was
puzzled too. For instance, my initial load statistics (carefully tracked by
my own Java-based bulk loader) were as follows:

Total=90467360 , create=0 , index=90467360 , delete=0

All index, no delete. And yet, Elasticsearch Head showed: docs: 76216778
So I did a scan, but ignored the source and just showed the meta data
(including version number). A grep -v version=1 soon showed that many
documents contained as many as 20 to 50 duplicate IDs. Very cool of
Elasticsearch to smoothly update.

It also confirmed for me that I have to process my bulk-loads (index only)
and bulk-updates (index+delete) sequentially: The customer supplies the
records in order that they should be applied, and any multi-threading on my
part would violate the customer-specified ordering.

And no matter which version of ES I've used (0.19.4, 0.19.10, 0.20.4, and
now 0.90.0), I get the same record counts after both the initial load and
then followed by the updates when using the same set of data (to ensure a
repeatable data source for logic and performance testing).

Elasticsearch rocks!

On Thursday, May 16, 2013 11:26:25 AM UTC-4, Ivan Brusic wrote:

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato <da...@pilato.fr<javascript:>

wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · May 16, 2013, 5:01pm

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

On Thu, May 16, 2013 at 9:39 AM, InquiringMind brian.from.fl@gmail.comwrote:

In the very beginning when I first noticed this same condition, I was
puzzled too. For instance, my initial load statistics (carefully tracked by
my own Java-based bulk loader) were as follows:

Total=90467360 , create=0 , index=90467360 , delete=0

All index, no delete. And yet, Elasticsearch Head showed: docs: 76216778
So I did a scan, but ignored the source and just showed the meta data
(including version number). A grep -v version=1 soon showed that many
documents contained as many as 20 to 50 duplicate IDs. Very cool of
Elasticsearch to smoothly update.

It also confirmed for me that I have to process my bulk-loads (index only)
and bulk-updates (index+delete) sequentially: The customer supplies the
records in order that they should be applied, and any multi-threading on my
part would violate the customer-specified ordering.

And no matter which version of ES I've used (0.19.4, 0.19.10, 0.20.4, and
now 0.90.0), I get the same record counts after both the initial load and
then followed by the updates when using the same set of data (to ensure a
repeatable data source for logic and performance testing).

Elasticsearch rocks!

On Thursday, May 16, 2013 11:26:25 AM UTC-4, Ivan Brusic wrote:

I thought searching on the _version field was not possible.

You can look at the BulkResponse and see which documents were creates and
which ones were updates.

--
Ivan

On Wed, May 15, 2013 at 9:59 PM, David Pilato da...@pilato.fr wrote:

Search for docs having _version > 1.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · May 16, 2013, 6:24pm

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load
tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future.

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mao_Ye · May 17, 2013, 4:09am

Hi, Thanks so much for those useful comments and suggestions. According to
the advices,

I changed the wonderdog jar and make sure to check the BulkResponse for
every batch, the code looks like this:
if(response.hasFailures())
{
throw new RuntimeException("BulkResponse show
failures: " + response.buildFailureMessage());
}

I add this code to make sure that if there is a failure in bulk response,
then just abort.

Before I do the index, I delete all the existing index.
After I start the index job, I didn't catch any errors. However, I
notice that the documents been indexed is 1 less than the total number I
should expect (Note that I am sure there is no duplicate)
After 30 minutes, I notice that the number of documents even decreased
by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should
provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order
to make the indexing faster. is it possible that replica=0 cause this
problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load
tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future.

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 17, 2013, 4:23am

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye m.daniel.ye@gmail.com a écrit :

Hi, Thanks so much for those useful comments and suggestions. According to the advices,

I changed the wonderdog jar and make sure to check the BulkResponse for every batch, the code looks like this:
if(response.hasFailures())
{
throw new RuntimeException("BulkResponse show failures: " + response.buildFailureMessage());
}

I add this code to make sure that if there is a failure in bulk response, then just abort.

Before I do the index, I delete all the existing index.

After I start the index job, I didn't catch any errors. However, I notice that the documents been indexed is 1 less than the total number I should expect (Note that I am sure there is no duplicate)

After 30 minutes, I notice that the number of documents even decreased by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order to make the indexing faster. is it possible that replica=0 cause this problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:
Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load tool to track this statistic (updated documents with a version number greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is especially valuable when I need to export an index, then delete that index and reload it for whatever reason. In this case, I preserve the version number; if _version is set to 2 or higher in an "index" or "create" action, it is treated as an EXTERNAL version. This not only preserves the original data and previously automatically-generated _id values, but also the version numbers. Of course, in this case that statistic will tell me how many version numbers were set to a value greater than 1 which gives me an indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the future.

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:
Just to add some pseudo-ish code to what I wrote before since the BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are referring to. I do some sanity-checking with code similar to the one I provided above.

Since you are using wonderdog, you would have to change its source for the verification check above. Looking at the wonderdog source code, it appears that they are not using Elasticsearch's newish BulkProcessor, so the problem might exist in their code (I didn't look too deeply at how the handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mao_Ye · May 17, 2013, 6:36am

I tried to add more replica, however it does not help.

I have another confusing observation:

If I keep indexing the same dataset, for the first time indexing, I got X1
number of documents indexed;
however for the second round of index, I got X2 number of documents
indexed. And X2 < X1.

Theoretically, X2 should be equal to X1 ... What's possible reason would
cause this problem? Thanks in advance!

Best

-Mao

On Thursday, May 16, 2013 9:23:05 PM UTC-7, David Pilato wrote:

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye <m.dan...@gmail.com <javascript:>> a écrit
:

Hi, Thanks so much for those useful comments and suggestions. According to
the advices,

I changed the wonderdog jar and make sure to check the BulkResponse for
every batch, the code looks like this:
if(response.hasFailures())
{
throw new RuntimeException("BulkResponse show
failures: " + response.buildFailureMessage());
}

I add this code to make sure that if there is a failure in bulk response,
then just abort.

Before I do the index, I delete all the existing index.

After I start the index job, I didn't catch any errors. However, I
notice that the documents been indexed is 1 less than the total number I
should expect (Note that I am sure there is no duplicate)

After 30 minutes, I notice that the number of documents even decreased
by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should
provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order
to make the indexing faster. is it possible that replica=0 cause this
problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk
load tool to track this statistic (updated documents with a version number
greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is
especially valuable when I need to export an index, then delete that index
and reload it for whatever reason. In this case, I preserve the version
number; if _version is set to 2 or higher in an "index" or "create" action,
it is treated as an EXTERNAL version. This not only preserves the original
data and previously automatically-generated _id values, but also the
version numbers. Of course, in this case that statistic will tell me how
many version numbers were set to a value greater than 1 which gives me an
indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the
future.

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the
BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are
referring to. I do some sanity-checking with code similar to the one I
provided above.

Since you are using wonderdog, you would have to change its source for
the verification check above. Looking at the wonderdog source code, it
appears that they are not using Elasticsearch's newish BulkProcessor, so
the problem might exist in their code (I didn't look too deeply at how the
handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 17, 2013, 6:40am

I don't see any.
Perhaps turning logs to debug could help?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 17 mai 2013 à 08:36, Mao Ye m.daniel.ye@gmail.com a écrit :

I tried to add more replica, however it does not help.

I have another confusing observation:

If I keep indexing the same dataset, for the first time indexing, I got X1 number of documents indexed;
however for the second round of index, I got X2 number of documents indexed. And X2 < X1.

Theoretically, X2 should be equal to X1 ... What's possible reason would cause this problem? Thanks in advance!

Best

-Mao

On Thursday, May 16, 2013 9:23:05 PM UTC-7, David Pilato wrote:

is it possible that replica=0 cause this problem?
No.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 17 mai 2013 à 06:09, Mao Ye m.dan...@gmail.com a écrit :

Hi, Thanks so much for those useful comments and suggestions. According to the advices,

I changed the wonderdog jar and make sure to check the BulkResponse for every batch, the code looks like this:
if(response.hasFailures())
{
throw new RuntimeException("BulkResponse show failures: " + response.buildFailureMessage());
}

I add this code to make sure that if there is a failure in bulk response, then just abort.

Before I do the index, I delete all the existing index.

After I start the index job, I didn't catch any errors. However, I notice that the documents been indexed is 1 less than the total number I should expect (Note that I am sure there is no duplicate)

After 30 minutes, I notice that the number of documents even decreased by 2.... I didn't nothing to the Elasticsearch cluster ...

Besides of the above steps, I think there are some information I should provide here.

I have 6 nodes cluster. When I do the index, I set the replica= 0 in order to make the indexing faster. is it possible that replica=0 cause this problem?

Best

-Mao

On Thursday, May 16, 2013 11:24:03 AM UTC-7, InquiringMind wrote:

Thanks, Ivan! I didn't realize I could do this! I am updating my bulk load tool to track this statistic (updated documents with a version number greater than one).

For my typical bulk load, this will indicate the number of duplicates.

But my bulk load tool also handles the _version meta data field. This is especially valuable when I need to export an index, then delete that index and reload it for whatever reason. In this case, I preserve the version number; if _version is set to 2 or higher in an "index" or "create" action, it is treated as an EXTERNAL version. This not only preserves the original data and previously automatically-generated _id values, but also the version numbers. Of course, in this case that statistic will tell me how many version numbers were set to a value greater than 1 which gives me an indication of how valuable preserving the version numbers turned out to be.

Mao Ye is using wonderdog; I'm not. But that's still good advice for the future.

Thanks again!!!

Brian

On Thursday, May 16, 2013 1:01:46 PM UTC-4, Ivan Brusic wrote:

Just to add some pseudo-ish code to what I wrote before since the BulkResponse does not explicitly define a document as a create or update.

BulkResponse bulkResponse = ...
int duplicates= 0;
for (BulkItemResponse bulkItemResponse: bulkResponse) {
if (bulkItemResponse.version() > 1) duplicates++;
}

Each BulkItemResponse will have the version number that David/Brian are referring to. I do some sanity-checking with code similar to the one I provided above.

Since you are using wonderdog, you would have to change its source for the verification check above. Looking at the wonderdog source code, it appears that they are not using Elasticsearch's newish BulkProcessor, so the problem might exist in their code (I didn't look too deeply at how the handle end of stream closes).

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

zhumeilian · October 11, 2014, 10:53am

hi,have you resolve your missing documents problem?I have the same problem.thank you .

Topic		Replies	Views
Document lost or not indexed during bulk index Elasticsearch	4	1647	July 23, 2020
Different count of documents reindex problem! Elasticsearch	2	438	March 26, 2017
Elasticsearch not showing correct count of documents in index Elasticsearch	10	486	June 13, 2023
Missing documents after a bulk index Elasticsearch	13	3365	July 6, 2017
Elasticsearch - Data loss while reindexing (scan and bulk insert) Elasticsearch	3	1542	July 6, 2017

Missing documents after bulk indexing

HTH

HTH

HTH

Related topics