Java client: typeExists() returns false after successful bulk index - why?


(Nikita Tovstoles) #1

ES newbie here. i noticed that typeExists() queries return false after
successful bulk index() but don't understand why. is that expected? Using
java client v 0.90.9. thanks in advance!

non-bulk works:

@Test
public void testTypeExists()
{

assertFalse(admin().indices().prepareTypesExists(INDEX_NAME).setTypes("foo").get().isExists());

    assertEquals("foo",

client().prepareIndex().setIndex(INDEX_NAME).setType("foo").setId("1").setSource("{"a":"b"}").get()
.getType());

assertTrue(admin().indices().prepareTypesExists(INDEX_NAME).setTypes("foo").get().isExists());
//RETURNS TRUE AS EXPECTED
}

bulk fails:

@Test
public void testTypeExistsAfterBulkIndex()
{

assertFalse(admin().indices().prepareTypesExists(INDEX_NAME).setTypes("foo").get().isExists());

    assertEquals("foo",

client().prepareBulk().add(client().prepareIndex().setIndex(INDEX_NAME).setType("foo").setId("1")

.setSource("{"a":"b"}")).execute().actionGet().getItems()[0].getType());
//SUCCEEDS

assertTrue(admin().indices().prepareTypesExists(INDEX_NAME).setTypes("foo").get().isExists());
//FAILS
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/10d8a7c9-7d4b-4065-a8e5-624fd2750393%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #2

A quick guess: The first one works because the first document for that type
is indexed and therefore the type is created when the operation returns.

But the second one doesn't work because there is a refresh interval between
the completion of a bulk load operation and the actual document being
added. And since it's the first document in the type, the type won't exist
until that first document is indexed. Which is likely exactly what you
want: Bulk operations need to defer until they are processed to allow for
optimizations. I don't know Lucene internals, but a B+Tree loads vastly
quicker when keys are presorted in bulk instead of added and committed one
by one.

The experts can chime in later, and if I'm wrong or off base anywhere I
welcome the correction!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f3374ff-b623-47ca-9e93-3eb2630b6ef3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nikita Tovstoles) #3

yes! adding a sleep after Future.get() of bulk op 'fixed' my test - thank
you.

what you said re: bulk op was submitted but not processed makes sense
(perhaps there is a separate API to query for op's completion status?) but
what is puzzling to me is that comments in source of BulkResponse seem to
imply it is constructed after op completes:

Holding a response for each item responding (in order) of the

  • bulk requests. Each item holds the index/type/id is operated on, and if
    it failed or not (with the
  • failure message).

..thus I was expecting that by the time
*ListenableActionFuture.get()
*returns the op is actually completed (not just submitted). Otherwise
status properties in embedded BulkItemResponse would not be useful, right?

On Thu, Jan 9, 2014 at 8:13 PM, InquiringMind brian.from.fl@gmail.comwrote:

A quick guess: The first one works because the first document for that
type is indexed and therefore the type is created when the operation
returns.

But the second one doesn't work because there is a refresh interval
between the completion of a bulk load operation and the actual document
being added. And since it's the first document in the type, the type won't
exist until that first document is indexed. Which is likely exactly what
you want: Bulk operations need to defer until they are processed to allow
for optimizations. I don't know Lucene internals, but a B+Tree loads vastly
quicker when keys are presorted in bulk instead of added and committed one
by one.

The experts can chime in later, and if I'm wrong or off base anywhere I
welcome the correction!

Brian

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/V1A1HbJFio4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6f3374ff-b623-47ca-9e93-3eb2630b6ef3%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJwaA22WFt%2B%2BQ2z%3DvBMpL_8ChjBB19OQcDWtTzwy8bd5xQ1sjw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #4

Bulk requests are processed together, so my only guess is that it returns
when it knows enough about whether the operation on each document succeeded
or failed. But there may still be a little more work before it finally
makes it to the database.

Instead of the sleep, try an index refresh. Maybe that will make the test
case more deterministic.

Also note that even in the first case, there will be a slight delay between
the return of the successful update and the ability to query on one or more
of the indexed fields. Indexing takes a little time (configurable;
typically about 1s or 2s for non-bulk updates), but the index operation
returns as soon as it knows all it needs to know about whether the document
can be successfully updated. However, a get-by-id operation can be done
immediately after an update; only the indexing of the fields as reflected
in the on-disk Lucene shard is not quiite realtime and synchronous to the
update request.

Also see:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-refresh

But in general: Whenever I index (create or update) a document, I leave the
recommended defaults in force, and only depend on a get-by-id to succeed
immediately after. And for bulk loading, I'm happy enough that 97 million
documents (not too complex, but still very useful) can be bulk loaded and
reloaded on my laptop in under 2.5 hours, and that chunks of "daily"
updates of a1 million updates and 1 million deletes can be bulk loaded in
10 or 15 minutes.

And I find that ElasticSearch provides a very good balance between optimal
update performance and quick search availability of those updates.

Also, my experience with your particular test case is very limited. I've
recently locked down ElasticSearch and all my mappings so that neither the
index nor the type is automatically created nor are unknown fields able to
be indexed. I must explicitly create an index and load all of the mappings
for all of the fields in all of the types before any documents are indexed.
But I digress...

Hope this helps!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29a731f6-87d9-4c97-9db6-d606f099e947%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5