Clarification about real time get with index updates and shard replcation

Firstly my understanding of what real time get means: A newly indexed
document or re-indexed existing document can be fetched using its ID
immediately after it has been indexed,
and therefore exists in the shards Lucene IndexWriter memory cache,
but before the shards Lucene IndexReader has been refreshed from the
IndexWriter, which for near real time search involves flushing the
IndexWriter to a new Lucene segment.

So, does this mean that ElasticSearch has a cache of recently indexed
documents that it first consults for get operations before initiating
a search of the index itself, or does it consult the IndexWriter
memory cache?
If a document has not been recently indexed does this automatically
cause an index search?
If there is a cache, are all gets cached for some time so that future
gets for the same ID do not incur an index search?
If there is a cache, when are its contents cleared?
Is there any locking to prevent multiple threads from concurrently
indexing different versions of a document with a given ID to ensure
that new version numbers are allocated based on when the index
operation starts and that there is no duplication of version numbers?

Now about shard replication: The default option is synchronous
replication whereby the index operation at the client returns after
all replicas have indexed the document
but the IndexReaders of the shards may not yet have been refreshed.
With the second option, asynchronous shard replication, the index
operation at the client returns after the primary shard has completed
the index operation with no guarantee that the replicas have completed
the index operation.

Is the real time get operation guaranteed to be consistent across all
replicas, ie returning the same newly indexed document, after the
index operation returns at the client with
a) synchronous shard replication?
b) asynchronous shard replication?

Regards
Mauri

On Wed, Oct 19, 2011 at 5:45 AM, Mauri mauri@proactive-edge.com.au wrote:

Firstly my understanding of what real time get means: A newly indexed
document or re-indexed existing document can be fetched using its ID
immediately after it has been indexed,
and therefore exists in the shards Lucene IndexWriter memory cache,
but before the shards Lucene IndexReader has been refreshed from the
IndexWriter, which for near real time search involves flushing the
IndexWriter to a new Lucene segment.

So, does this mean that Elasticsearch has a cache of recently indexed
documents that it first consults for get operations before initiating
a search of the index itself, or does it consult the IndexWriter
memory cache?

Its not a cache, it uses the transaction log to fetch the document when
using realtime get.

If a document has not been recently indexed does this automatically
cause an index search?

Yes, it will do a search to find the doc if its not in the transaction log
anymore.

If there is a cache, are all gets cached for some time so that future
gets for the same ID do not incur an index search?
If there is a cache, when are its contents cleared?

Its tied into the transaction log logic, so when the transaction log gets
cleared (called flush in elasticsearch), realtime get for those docs will
now go and do a "search".

Is there any locking to prevent multiple threads from concurrently
indexing different versions of a document with a given ID to ensure
that new version numbers are allocated based on when the index
operation starts and that there is no duplication of version numbers?

Yes.

Now about shard replication: The default option is synchronous
replication whereby the index operation at the client returns after
all replicas have indexed the document
but the IndexReaders of the shards may not yet have been refreshed.
With the second option, asynchronous shard replication, the index
operation at the client returns after the primary shard has completed
the index operation with no guarantee that the replicas have completed
the index operation.

Is the real time get operation guaranteed to be consistent across all
replicas, ie returning the same newly indexed document, after the
index operation returns at the client with
a) synchronous shard replication?
b) asynchronous shard replication?

A get operation will go to either the primary shard or the replica shard.
With async replication, you might get a stale version of the doc. You can
set a preference flag in the get API to say that it will execute only on the
primary shard.

Regards
Mauri

Thanks Shay

A few more points for clarification:

On indexing

  1. Does each indexing operation incur a prior index search to check
    for existing versions of a document followed by a delete if one is
    found?
  2. If so, is it possible to disable this if an external mechanism can
    guarantee that there will not be a clash of ID values, eg for a bulk
    import from another system that collects data and allocates unique
    document IDs?
  3. What is the impact if there is duplication of ID numbers, eg if the
    mechanism in 2 fails?

On cluster management with large number of indexes (eg index per user,
index per month, etc)
4. Do all servers in a cluster maintain synchronized maps containing
details of all other servers in the cluster, the indexes in the
cluster and all shards/replicas of all indexes?
5. In the situation where there are many unrelated indexes, eg per
user with 100's of users (hypothetically), is there a point where it
is better to have several ES clusters rather than one large one, eg 4
x 5 server clusters rather than 1 x 20 server cluster.
In particular, I am wondering if there is a point where the
overheads associated with maintaining cluster configuration data,
cluster management messaging, etc become excessive and it is better to
split out to multiple clusters?

Regards
Mauri