Missing documents after a bulk index


(Ivan Brusic) #1

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan


(sujoysett) #2

Don't know your exact scenario or application, but I have faced such
problem twice before.

Once, when gathering documents from multiple sources into a single index,
there was overlap of document ids, which led to overwriting some documents.
Another time, some UTF-8 characters in the documents was causing failure of
some requests of every bulk. Removing UTF-8 chars using regex helped.

On Wednesday, May 2, 2012 9:51:27 AM UTC+5:30, Ivan Brusic wrote:

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan


(Ivan Brusic) #3

The document creation code has been running for Lucene for years, so I
do not think it is re-using doc ids (although Lucene does not have doc
ids).

My bulk indexer borrows heavily from existing ElasticSearch code:
https://gist.github.com/2577955

The log statements on lines 51 (throttling) and 78 (which keeps tracks
of duplicate ids) are never called, so it appears that all the
documents should have been indexed.

--
Ivan

On Wed, May 2, 2012 at 1:01 AM, Sujoy Sett sujoysett@gmail.com wrote:

Don't know your exact scenario or application, but I have faced such problem
twice before.

Once, when gathering documents from multiple sources into a single index,
there was overlap of document ids, which led to overwriting some documents.
Another time, some UTF-8 characters in the documents was causing failure of
some requests of every bulk. Removing UTF-8 chars using regex helped.

On Wednesday, May 2, 2012 9:51:27 AM UTC+5:30, Ivan Brusic wrote:

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan


(Shay Banon) #4

Is your class being called from multiple threads? Maybe thats the problem?
Also, you can mark the IndexRequest with a flag of create, in which case it
will fail if it already exists in the index.

On Wed, May 2, 2012 at 8:17 PM, Ivan Brusic ivan@brusic.com wrote:

The document creation code has been running for Lucene for years, so I
do not think it is re-using doc ids (although Lucene does not have doc
ids).

My bulk indexer borrows heavily from existing ElasticSearch code:
https://gist.github.com/2577955

The log statements on lines 51 (throttling) and 78 (which keeps tracks
of duplicate ids) are never called, so it appears that all the
documents should have been indexed.

--
Ivan

On Wed, May 2, 2012 at 1:01 AM, Sujoy Sett sujoysett@gmail.com wrote:

Don't know your exact scenario or application, but I have faced such
problem
twice before.

Once, when gathering documents from multiple sources into a single index,
there was overlap of document ids, which led to overwriting some
documents.
Another time, some UTF-8 characters in the documents was causing failure
of
some requests of every bulk. Removing UTF-8 chars using regex helped.

On Wednesday, May 2, 2012 9:51:27 AM UTC+5:30, Ivan Brusic wrote:

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan


(Jörg Prante) #5

I perform multithreaded bulk writes with ~2000-3000 docs per second over
~90 minutes. Because such data load will cause high I/O load every 20-30
minutes or so, there is a need to balance things out by setting a maximum
limit of simultaneous multithreaded requests for that peak situations.
Threads should wait until the ES indexer responds to outstanding bulk
requests. Maybe you can get some inspiration from my code here - I also
started from existing Elasticsearch code: https://gist.github.com/2578923

Jörg


(Ivan Brusic) #6

Yes, the code is multi-threaded. It is the same code that has creates
Lucene documents successfully.

Part of the problem might be our expected counts. We are in
development phase and are pointing to a non-active table, so our
expected count might be slightly off. In fact, it might be closer to
30M and not 36M. However, I am still confused about the discrepancy
between docs.count and indexing.index_total. Now I don't remember if I
indexed test documents prior to the bulk index!

--
Ivan

On Wed, May 2, 2012 at 10:20 AM, Shay Banon kimchy@gmail.com wrote:

Is your class being called from multiple threads? Maybe thats the problem?
Also, you can mark the IndexRequest with a flag of create, in which case it
will fail if it already exists in the index.

On Wed, May 2, 2012 at 8:17 PM, Ivan Brusic ivan@brusic.com wrote:

The document creation code has been running for Lucene for years, so I
do not think it is re-using doc ids (although Lucene does not have doc
ids).

My bulk indexer borrows heavily from existing ElasticSearch code:
https://gist.github.com/2577955

The log statements on lines 51 (throttling) and 78 (which keeps tracks
of duplicate ids) are never called, so it appears that all the
documents should have been indexed.

--
Ivan

On Wed, May 2, 2012 at 1:01 AM, Sujoy Sett sujoysett@gmail.com wrote:

Don't know your exact scenario or application, but I have faced such
problem
twice before.

Once, when gathering documents from multiple sources into a single
index,
there was overlap of document ids, which led to overwriting some
documents.
Another time, some UTF-8 characters in the documents was causing failure
of
some requests of every bulk. Removing UTF-8 chars using regex helped.

On Wednesday, May 2, 2012 9:51:27 AM UTC+5:30, Ivan Brusic wrote:

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan


(Ivan Brusic) #7

Jörg,

Your code is not waiting for a index request response since you are
using an ActionListener. I switched to using execute().actionGet()
precisely because I was swamping ES with too much data.

--
Ivan

On Wed, May 2, 2012 at 11:38 AM, Jörg Prante joergprante@gmail.com wrote:

I perform multithreaded bulk writes with ~2000-3000 docs per second over ~90
minutes. Because such data load will cause high I/O load every 20-30 minutes
or so, there is a need to balance things out by setting a maximum limit of
simultaneous multithreaded requests for that peak situations. Threads should
wait until the ES indexer responds to outstanding bulk requests. Maybe you
can get some inspiration from my code here - I also started from existing
Elasticsearch code: https://gist.github.com/2578923

Jörg


(Shay Banon) #8

Ivan, any update, did you manage to solve this? One more thing, adding
requests to the bulk request from multiple threads is problematic since the
list there is not thread safe. I really need to write a BulkProcessor that
allows for multiple threads to add requests and allows for simple
throttling control (can use that in several places myself).

On Wed, May 2, 2012 at 9:45 PM, Ivan Brusic ivan@brusic.com wrote:

Jörg,

Your code is not waiting for a index request response since you are
using an ActionListener. I switched to using execute().actionGet()
precisely because I was swamping ES with too much data.

--
Ivan

On Wed, May 2, 2012 at 11:38 AM, Jörg Prante joergprante@gmail.com
wrote:

I perform multithreaded bulk writes with ~2000-3000 docs per second over
~90
minutes. Because such data load will cause high I/O load every 20-30
minutes
or so, there is a need to balance things out by setting a maximum limit
of
simultaneous multithreaded requests for that peak situations. Threads
should
wait until the ES indexer responds to outstanding bulk requests. Maybe
you
can get some inspiration from my code here - I also started from existing
Elasticsearch code: https://gist.github.com/2578923

Jörg


(Jörg Prante) #9

This is by intention and ensures asynchronous multithreaded indexing. I
submit a number of requests in parallel by a lot of threads, and I do not
wait for ES bulk response. This number of asynchronous submits is bound
(e.g. 30 open bulk requests), therefore I use a lot of resources, but I do
not swamp ES with data. The ActionListener is invoked later by other
response threads.

If you switch to execute.actionGet(), you just choose to perform bulk in
synchronous mode, or 1 open bulk request at a time.

Jörg

On Wednesday, May 2, 2012 8:45:23 PM UTC+2, Ivan Brusic wrote:

Jörg,

Your code is not waiting for a index request response since you are
using an ActionListener. I switched to using execute().actionGet()
precisely because I was swamping ES with too much data.

--
Ivan

On Wed, May 2, 2012 at 11:38 AM, Jörg Prante joergprante@gmail.com
wrote:

I perform multithreaded bulk writes with ~2000-3000 docs per second over
~90
minutes. Because such data load will cause high I/O load every 20-30
minutes
or so, there is a need to balance things out by setting a maximum limit
of
simultaneous multithreaded requests for that peak situations. Threads
should
wait until the ES indexer responds to outstanding bulk requests. Maybe
you
can get some inspiration from my code here - I also started from
existing
Elasticsearch code: https://gist.github.com/2578923

Jörg


(Ivan Brusic) #10

As mentioned before, there might have been an issue with our expected
count, so the results might be correct. Still confused about the
difference between the doc count,and the index_total. Will probably
attempt another full index rebuild later today or tomorrow.

Each writer thread in the system has its own bulk indexer, which
solves the thread safe issue. Synchronous calls work well of us since
it allows use to collect detailed metrics in one place.

--
Ivan

On Sun, May 6, 2012 at 11:25 PM, Shay Banon kimchy@gmail.com wrote:

Ivan, any update, did you manage to solve this? One more thing, adding
requests to the bulk request from multiple threads is problematic since the
list there is not thread safe. I really need to write a BulkProcessor that
allows for multiple threads to add requests and allows for simple throttling
control (can use that in several places myself).

On Wed, May 2, 2012 at 9:45 PM, Ivan Brusic ivan@brusic.com wrote:

Jörg,

Your code is not waiting for a index request response since you are
using an ActionListener. I switched to using execute().actionGet()
precisely because I was swamping ES with too much data.

--
Ivan

On Wed, May 2, 2012 at 11:38 AM, Jörg Prante joergprante@gmail.com
wrote:

I perform multithreaded bulk writes with ~2000-3000 docs per second over
~90
minutes. Because such data load will cause high I/O load every 20-30
minutes
or so, there is a need to balance things out by setting a maximum limit
of
simultaneous multithreaded requests for that peak situations. Threads
should
wait until the ES indexer responds to outstanding bulk requests. Maybe
you
can get some inspiration from my code here - I also started from
existing
Elasticsearch code: https://gist.github.com/2578923

Jörg


(Shay Banon) #11

I see, so just I am clear Ivan, each thread has its own bulk instance that
you posted?

On Mon, May 7, 2012 at 8:16 PM, Ivan Brusic ivan@brusic.com wrote:

As mentioned before, there might have been an issue with our expected
count, so the results might be correct. Still confused about the
difference between the doc count,and the index_total. Will probably
attempt another full index rebuild later today or tomorrow.

Each writer thread in the system has its own bulk indexer, which
solves the thread safe issue. Synchronous calls work well of us since
it allows use to collect detailed metrics in one place.

--
Ivan

On Sun, May 6, 2012 at 11:25 PM, Shay Banon kimchy@gmail.com wrote:

Ivan, any update, did you manage to solve this? One more thing, adding
requests to the bulk request from multiple threads is problematic since
the
list there is not thread safe. I really need to write a BulkProcessor
that
allows for multiple threads to add requests and allows for simple
throttling
control (can use that in several places myself).

On Wed, May 2, 2012 at 9:45 PM, Ivan Brusic ivan@brusic.com wrote:

Jörg,

Your code is not waiting for a index request response since you are
using an ActionListener. I switched to using execute().actionGet()
precisely because I was swamping ES with too much data.

--
Ivan

On Wed, May 2, 2012 at 11:38 AM, Jörg Prante joergprante@gmail.com
wrote:

I perform multithreaded bulk writes with ~2000-3000 docs per second
over

~90
minutes. Because such data load will cause high I/O load every 20-30
minutes
or so, there is a need to balance things out by setting a maximum
limit

of
simultaneous multithreaded requests for that peak situations. Threads
should
wait until the ES indexer responds to outstanding bulk requests. Maybe
you
can get some inspiration from my code here - I also started from
existing
Elasticsearch code: https://gist.github.com/2578923

Jörg


(Ivan Brusic) #12

Correct. I originally wrote that bulk indexer code when I was running
a single-threaded indexer (river), so I made minimal changes to
support our existing multi-threaded Lucene indexing.

This ElasticSearch project has been put on hold for a while, but is
finally starting up again. Most of the code is proof-of-concept at
this stage and will get firmed up as time goes by. Still need to build
another complete index.

--
Ivan

On Tue, May 8, 2012 at 6:49 AM, Shay Banon kimchy@gmail.com wrote:

I see, so just I am clear Ivan, each thread has its own bulk instance that
you posted?


(Santiago Trías) #13

I got into lost documents when trying to do Bulk requests on my local
server.
I was doing 1000 per request and I was loosing around 80% of the documents.
Changing to 10 solved it.
Any other solution to this? I have to load 11 million documents and even
multi threading is kind of slow doing it 10 at a time.

Thanks.

El martes, 1 de mayo de 2012 21:21:27 UTC-7, Ivan Brusic escribió:

Just finished finished bulk indexing 36 million documents to a single
node with 5 shards. However, there are only 30 million products in the
index. The node stats are:

docs": {
"count": 30287500,
"deleted": 0
},
"indexing": {
"index_total": 38177500,
"index_time": "1.6d",
"index_time_in_millis": 146190895,
"index_current": 0,
"delete_total": 0,
"delete_time": "0s",
"delete_time_in_millis": 0,
"delete_current": 0
}

Why the large discrepancy between the expected count, the doc count,
and the index_total?

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/43b9865c-1a4f-47db-8410-6827a3af405e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #14