Firstly my understanding of what real time get means: A newly indexed
document or re-indexed existing document can be fetched using its ID
immediately after it has been indexed,
and therefore exists in the shards Lucene IndexWriter memory cache,
but before the shards Lucene IndexReader has been refreshed from the
IndexWriter, which for near real time search involves flushing the
IndexWriter to a new Lucene segment.
So, does this mean that ElasticSearch has a cache of recently indexed
documents that it first consults for get operations before initiating
a search of the index itself, or does it consult the IndexWriter
memory cache?
If a document has not been recently indexed does this automatically
cause an index search?
If there is a cache, are all gets cached for some time so that future
gets for the same ID do not incur an index search?
If there is a cache, when are its contents cleared?
Is there any locking to prevent multiple threads from concurrently
indexing different versions of a document with a given ID to ensure
that new version numbers are allocated based on when the index
operation starts and that there is no duplication of version numbers?
Now about shard replication: The default option is synchronous
replication whereby the index operation at the client returns after
all replicas have indexed the document
but the IndexReaders of the shards may not yet have been refreshed.
With the second option, asynchronous shard replication, the index
operation at the client returns after the primary shard has completed
the index operation with no guarantee that the replicas have completed
the index operation.
Is the real time get operation guaranteed to be consistent across all
replicas, ie returning the same newly indexed document, after the
index operation returns at the client with
a) synchronous shard replication?
b) asynchronous shard replication?
Regards
Mauri