Bulk indexing and count mismatch


(phoenix) #1

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index 100
randomly created documents (with random generated strings). Then it
executes a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a
actually have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet() method
was waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().actionGet();

Any idea ?

Frederic


(Ivan Brusic) #2

Are you indexing to a new id for each document? That might account for
seeing only one document, but zero documents is a different issue.

--
Ivan

On Thu, Jun 14, 2012 at 9:40 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index 100
randomly created documents (with random generated strings). Then it executes
a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a actually
have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet() method was
waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().actionGet();

Any idea ?

Frederic


(phoenix) #3

We are not giving any id to our documents. According to the API,
ElasticSearch is supposed to generate one by itself.
And actually if we add a Thread.sleep(...) before asking for the document
count, we get the expected result. So it seems it just takes time.
But i was thinking (and reading the javadoc seems to confirm it) that the
execute().actionGet() call was waiting for the completion of the task to
return.
Is it different for bulk requests?
Is there any way to actually wait for the indexing to be done? (Not really
important at runtime, but for testing purposes, it is).

Frederic

On Thu, Jun 14, 2012 at 7:15 PM, Ivan Brusic ivan@brusic.com wrote:

Are you indexing to a new id for each document? That might account for
seeing only one document, but zero documents is a different issue.

--
Ivan

On Thu, Jun 14, 2012 at 9:40 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index 100
randomly created documents (with random generated strings). Then it
executes
a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a
actually
have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet() method
was
waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().actionGet();

Any idea ?

Frederic


(Ivan Brusic) #4

The synchronous calls ensures that the operation is committed at the
server level, but there can still be delays at the Lucene level. The
default index refresh interval is 1 second. How many BulkItemResponses
do you have in your BulkResponse?

--
Ivan

On Fri, Jun 15, 2012 at 12:16 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

We are not giving any id to our documents. According to the API,
ElasticSearch is supposed to generate one by itself.
And actually if we add a Thread.sleep(...) before asking for the document
count, we get the expected result. So it seems it just takes time.
But i was thinking (and reading the javadoc seems to confirm it) that the
execute().actionGet() call was waiting for the completion of the task to
return.
Is it different for bulk requests?
Is there any way to actually wait for the indexing to be done? (Not really
important at runtime, but for testing purposes, it is).

Frederic

On Thu, Jun 14, 2012 at 7:15 PM, Ivan Brusic ivan@brusic.com wrote:

Are you indexing to a new id for each document? That might account for
seeing only one document, but zero documents is a different issue.

--
Ivan

On Thu, Jun 14, 2012 at 9:40 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index 100
randomly created documents (with random generated strings). Then it
executes
a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a
actually
have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet() method
was
waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().actionGet();

Any idea ?

Frederic


(Igor Motov) #5

Frederic,

Just add explicit refresh before checking document count:

client.admin().indices().prepareRefresh().execute().actionGet();

This command will ensure that all indexed records are committed and and
available in your searches.

On Friday, June 15, 2012 12:25:39 PM UTC-4, Ivan Brusic wrote:

The synchronous calls ensures that the operation is committed at the
server level, but there can still be delays at the Lucene level. The
default index refresh interval is 1 second. How many BulkItemResponses
do you have in your BulkResponse?

--
Ivan

On Fri, Jun 15, 2012 at 12:16 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

We are not giving any id to our documents. According to the API,
ElasticSearch is supposed to generate one by itself.
And actually if we add a Thread.sleep(...) before asking for the
document
count, we get the expected result. So it seems it just takes time.
But i was thinking (and reading the javadoc seems to confirm it) that
the
execute().actionGet() call was waiting for the completion of the task to
return.
Is it different for bulk requests?
Is there any way to actually wait for the indexing to be done? (Not
really
important at runtime, but for testing purposes, it is).

Frederic

On Thu, Jun 14, 2012 at 7:15 PM, Ivan Brusic ivan@brusic.com wrote:

Are you indexing to a new id for each document? That might account for
seeing only one document, but zero documents is a different issue.

--
Ivan

On Thu, Jun 14, 2012 at 9:40 AM, Frederic Esnault
esnault.frederic@gmail.com wrote:

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index
100

randomly created documents (with random generated strings). Then it
executes
a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a
actually
have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet()
method

was
waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().actionGet();

Any idea ?

Frederic


(phoenix) #6

@Ivan :
We have 1 or 100, depending on if we asked for sleep on the thread.

@Igor :
Thx for the tip, we'll try to refresh first, we'll let you know the result
:slight_smile:

Frederic

On Friday, June 15, 2012, Igor Motov wrote:

Frederic,

Just add explicit refresh before checking document count:

client.admin().indices().prepareRefresh().execute().actionGet();

This command will ensure that all indexed records are committed and and
available in your searches.

On Friday, June 15, 2012 12:25:39 PM UTC-4, Ivan Brusic wrote:

The synchronous calls ensures that the operation is committed at the
server level, but there can still be delays at the Lucene level. The
default index refresh interval is 1 second. How many BulkItemResponses
do you have in your BulkResponse?

--
Ivan

On Fri, Jun 15, 2012 at 12:16 AM, Frederic Esnault
<esnault.frederic@gmail.com <javascript:_e({}, 'cvml',
'esnault.frederic@gmail.com');>> wrote:

We are not giving any id to our documents. According to the API,
ElasticSearch is supposed to generate one by itself.
And actually if we add a Thread.sleep(...) before asking for the
document
count, we get the expected result. So it seems it just takes time.
But i was thinking (and reading the javadoc seems to confirm it) that
the
execute().actionGet() call was waiting for the completion of the task
to
return.
Is it different for bulk requests?
Is there any way to actually wait for the indexing to be done? (Not
really
important at runtime, but for testing purposes, it is).

Frederic

On Thu, Jun 14, 2012 at 7:15 PM, Ivan Brusic <ivan@brusic.com<javascript:_e({}, 'cvml', 'ivan@brusic.com');>>
wrote:

Are you indexing to a new id for each document? That might account for
seeing only one document, but zero documents is a different issue.

--
Ivan

On Thu, Jun 14, 2012 at 9:40 AM, Frederic Esnault
<esnault.frederic@gmail.com <javascript:_e({}, 'cvml',
'esnault.frederic@gmail.com');>> wrote:

Hi all,

I'm facing a strange situation.
I'm indexing some documents using the bulk java api.
My problem is in my tests. First my test uses the bulk api to index
100

randomly created documents (with random generated strings). Then it
executes
a count query to check i now have a 100 documents in my index.
But actually i have 0, or maybe one. And sometime while debugging a
actually
have 100 docs.
Is this a sycnhronization problem? I thought that the actionGet()
method

was
waiting for the job to be done before returning.

My code is as follows :

BulkResponse response = bulkRequest.execute().**actionGet();

Any idea ?

Frederic


(phoenix) #7

Thanks Igor, it works perfectly !


(system) #8