Elastic Search Record Availability Time

Hi there ..

We have some queries on the record being available for search once it is put on the cluster.

  1. We wanted to know how soon an indexed record is available for search .. and the factors which determine this availability time.
  2. We have also observed that a records is available even earlier than the refresh interval. It would be great if we could have some explanation for this .
  3. Also , is there way to prove this availability time under fixed load criteria on the cluster (Eg. A test case or something like that which we can Simulate)

Hoping to improve our knowledge on this ...

Do you mean available by the document get API? You discovered the near real time feature of a document.



Yes I meant the document Get API .... How different would it be if we were to use search API instead ?

We have explored the Near Real Time Search Feature and were assuming that a record would be searchable only when it is put in the new segment and the segment is opened for search ..Which I believe happens on a Refresh.

But the doc Get API returns with the document much earlier than the refresh time interval itself.
(We got a GET returned with found = true within 93 Milli secs , where as the refresh interval was a default 1 second )

Pardon me .. but I could not get how is versioning in ElasticSearch related to the time by when a record is available for search .

The background is that documents get into a queue before being indexed.

A document will be versioned first. At that time, it is placed into a RAM buffer which is a tiny Lucene RAM-only segment. When Lucene receives this document, if can offer rudimentary retrieval operations by a doc ID. Elasticsearch just re-uses this Lucene near-real-time feature for providing this as an Elasticsearch feature which works for all shards of a distributed index.

Search comes later. I skip the "translog" story (this can be seen as a write-ahead log file). From the RAM buffer, the document is analyzed and tokenized and filtered etc. to generate a token stream from the document which finally enters a dictionary data structure on disk. This inverted index is synced from time to time to disk, making the document searchable. Each sync step produces a segment. Because this is an I/O expensive operation, Elasticsearch flushes this buffer every 1s only, in order to batch as many docs as possible into a new Lucene segment. So, you can search for terms in the document not instantly, but only after a delay, which saves a lot of expensive I/O costs.

There is a manual operation called _refresh which combines the necessary steps to clear the RAM buffers, write all outstanding data to disk, and switches the Lucene state to the latest segments. This operation should not executed manually too often, because it interferes with the automatic refreshing and adds only extra load.

Hey.. Thanks for a detailed info .:smile:

It would be great if you could point me to a source of this info (ES whitepaper or documentation ) ,which will explain what happens since the clusters accepts the record and queue it ,to the time when this segment is written to File System Cache and opened for search.

And just a clarification ... I got the intent of what you said while describing Refresh API , But I think you are confusing between refresh and flush API s or may be that you are happy to keep me in that Abstraction :smile:

Between .. I agree to you that GET by ID API may not be the right way to measure the time it took for the record to be available for search.
We used a term query as well earlier (Search API) , But even this would not be right measure of this time interval as search api has a delay of several milli secs associated to its execution, And under heavy load this delay will still be higher. i.e . It wont give time it took for the records to be searchable but gives the time it took for the records to be searchable + search API execution time .

Thanks for clarifying the earlier parts of my query ... :smile:

Now.. This brings us to the last part of it ..

If I want to establish the interval , By which the a record requested to be indexed is available for search .. How do I do it ?
Is there way to prove this availability time under fixed load criteria on the cluster (Eg. A test case or something like that which we can Simulate)