Guaranteed upper bound for near real time search


(vishrut.goyal) #1

Hello,

Although real time searches are not possible in Elasticsearch, but "near
real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second, it's
not "guaranteed" that an indexed document will be available for search
after 1 second. My load tests indicate that sometimes even after 3 seconds
of indexing a document, it is not available for search. As per the
elasticsearch documentation, "refresh_interval" controls "how often the
refresh operation will be executed". It does not provide an upper bound on
the delay between indexing a document and that document being available for
search.

Is there any other setting in Elasticsearch that can guarantee such bound?

Thanks,
Vishrut

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Do you compute this 3s delay between when you send the document and the search request? Or between the index response from elasticsearch and the search request?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 2 janv. 2015 à 09:17, vishrut.goyal@gmail.com a écrit :

Hello,

Although real time searches are not possible in Elasticsearch, but "near real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second, it's not "guaranteed" that an indexed document will be available for search after 1 second. My load tests indicate that sometimes even after 3 seconds of indexing a document, it is not available for search. As per the elasticsearch documentation, "refresh_interval" controls "how often the refresh operation will be executed". It does not provide an upper bound on the delay between indexing a document and that document being available for search.

Is there any other setting in Elasticsearch that can guarantee such bound?

Thanks,
Vishrut

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8BF8E59E-2233-4BD4-B89E-C5457185832A%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #3

First, it seems you confuse "being available for search" with "near real
time". These are two different things:

  1. search after indexing is expected to take a long time because of the
    unpredictable overhead in worst case scenarios (e.g. creating index,
    creating mapping, creating document, creating segments, creating replica on
    other nodes, segment merge etc.)

  2. near real time: getting a doc ID after indexing is very fast because the
    document is immediately available in a special Lucene segment kept in RAM
    (in the millisecond range)

You can experiment and reduce refresh interval to 50ms and exercise
Elasticsearch (term) query operation. On RAM-only clusters, you will get
best results, but that has nothing to do with the (near) real time feature
of Elasticsearch get operation.

Also note the complexity of distributed systems. As long as there is no
information about the workload distribution and no priority index queues
are used, no upper time bound (deadlines) can be set in distributed
indexing.

If you want to find out about a faster real time switch between index write
and read, you may have interest in using

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/ControlledRealTimeReopenThread.html

but you have to use your own custom code, because Elasticsearch does not
make use of ControlledRealTimeReopenThread.

Jörg

On Fri, Jan 2, 2015 at 9:17 AM, vishrut.goyal@gmail.com wrote:

Hello,

Although real time searches are not possible in Elasticsearch, but "near
real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second, it's
not "guaranteed" that an indexed document will be available for search
after 1 second. My load tests indicate that sometimes even after 3 seconds
of indexing a document, it is not available for search. As per the
elasticsearch documentation, "refresh_interval" controls "how often the
refresh operation will be executed". It does not provide an upper bound on
the delay between indexing a document and that document being available for
search.

Is there any other setting in Elasticsearch that can guarantee such bound?

Thanks,
Vishrut

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGj98SJgMt9NHVVKNKp0_2NdOt_QSsTHLvdN6ZcrkZ%2BaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(vishrut.goyal) #4

The 3 second delay is between the index response and the search request. I
am blocking on indexing operation till I get the response, and then
scheduling the search request 3 seconds after I get the response.

Thanks,
Vishrut

On Friday, January 2, 2015 2:15:28 PM UTC+5:30, David Pilato wrote:

Do you compute this 3s delay between when you send the document and the
search request? Or between the index response from elasticsearch and the
search request?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 2 janv. 2015 à 09:17, vishru...@gmail.com <javascript:> a écrit :

Hello,

Although real time searches are not possible in Elasticsearch, but "near
real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second, it's
not "guaranteed" that an indexed document will be available for search
after 1 second. My load tests indicate that sometimes even after 3 seconds
of indexing a document, it is not available for search. As per the
elasticsearch documentation, "refresh_interval" controls "how often the
refresh operation will be executed". It does not provide an upper bound on
the delay between indexing a document and that document being available for
search.

Is there any other setting in Elasticsearch that can guarantee such bound?

Thanks,
Vishrut

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d8c2142-46f4-441e-b4f9-2af22f40721d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(vishrut.goyal) #5

As per Elasticsearch documentation, "Get" operation is completely real time
(unless explicitly disabled). We can immediately get the document using the
doc ID immediately after indexing the document.
I am talking of Search operation here, which can be made "near realtime" by
controlling the value of "refresh_interval".

Thanks,
Vishrut

On Friday, January 2, 2015 2:22:03 PM UTC+5:30, Jörg Prante wrote:

First, it seems you confuse "being available for search" with "near real
time". These are two different things:

  1. search after indexing is expected to take a long time because of the
    unpredictable overhead in worst case scenarios (e.g. creating index,
    creating mapping, creating document, creating segments, creating replica on
    other nodes, segment merge etc.)

  2. near real time: getting a doc ID after indexing is very fast because
    the document is immediately available in a special Lucene segment kept in
    RAM (in the millisecond range)

You can experiment and reduce refresh interval to 50ms and exercise
Elasticsearch (term) query operation. On RAM-only clusters, you will get
best results, but that has nothing to do with the (near) real time feature
of Elasticsearch get operation.

Also note the complexity of distributed systems. As long as there is no
information about the workload distribution and no priority index queues
are used, no upper time bound (deadlines) can be set in distributed
indexing.

If you want to find out about a faster real time switch between index
write and read, you may have interest in using

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/ControlledRealTimeReopenThread.html

but you have to use your own custom code, because Elasticsearch does not
make use of ControlledRealTimeReopenThread.

Jörg

On Fri, Jan 2, 2015 at 9:17 AM, <vishru...@gmail.com <javascript:>> wrote:

Hello,

Although real time searches are not possible in Elasticsearch, but "near
real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second, it's
not "guaranteed" that an indexed document will be available for search
after 1 second. My load tests indicate that sometimes even after 3 seconds
of indexing a document, it is not available for search. As per the
elasticsearch documentation, "refresh_interval" controls "how often the
refresh operation will be executed". It does not provide an upper bound on
the delay between indexing a document and that document being available for
search.

Is there any other setting in Elasticsearch that can guarantee such bound?

Thanks,
Vishrut

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/550c4629-1a82-4940-9acb-f00f094f22ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #6

The 1s refresh_interval means that ES will open (takes some time) and warm
(takes some more time) a new NRT reader, and after that reader is done
opening, 1s later it will open again.

So it's possible in your case it takes 2s to open + warm a new NRT reader
(check the node's logs). But 2s is quite a long time for the reopen unless
the index has changed a lot (which is unlikely with 1s refresh_interval).

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jan 2, 2015 at 4:09 AM, vishrut.goyal@gmail.com wrote:

As per Elasticsearch documentation, "Get" operation is completely real
time (unless explicitly disabled). We can immediately get the document
using the doc ID immediately after indexing the document.
I am talking of Search operation here, which can be made "near realtime"
by controlling the value of "refresh_interval".

Thanks,
Vishrut

On Friday, January 2, 2015 2:22:03 PM UTC+5:30, Jörg Prante wrote:

First, it seems you confuse "being available for search" with "near real
time". These are two different things:

  1. search after indexing is expected to take a long time because of the
    unpredictable overhead in worst case scenarios (e.g. creating index,
    creating mapping, creating document, creating segments, creating replica on
    other nodes, segment merge etc.)

  2. near real time: getting a doc ID after indexing is very fast because
    the document is immediately available in a special Lucene segment kept in
    RAM (in the millisecond range)

You can experiment and reduce refresh interval to 50ms and exercise
Elasticsearch (term) query operation. On RAM-only clusters, you will get
best results, but that has nothing to do with the (near) real time feature
of Elasticsearch get operation.

Also note the complexity of distributed systems. As long as there is no
information about the workload distribution and no priority index queues
are used, no upper time bound (deadlines) can be set in distributed
indexing.

If you want to find out about a faster real time switch between index
write and read, you may have interest in using

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/
ControlledRealTimeReopenThread.html

but you have to use your own custom code, because Elasticsearch does not
make use of ControlledRealTimeReopenThread.

Jörg

On Fri, Jan 2, 2015 at 9:17 AM, vishru...@gmail.com wrote:

Hello,

Although real time searches are not possible in Elasticsearch, but "near
real time" are possible by setting "refresh_interval" to "1s" (1 second).
The problem is that even after setting refresh interval to 1 second,
it's not "guaranteed" that an indexed document will be available for search
after 1 second. My load tests indicate that sometimes even after 3 seconds
of indexing a document, it is not available for search. As per the
elasticsearch documentation, "refresh_interval" controls "how often the
refresh operation will be executed". It does not provide an upper bound on
the delay between indexing a document and that document being available for
search.

Is there any other setting in Elasticsearch that can guarantee such
bound?

Thanks,
Vishrut

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/347d1f45-c243-4c87-b4ae-e02eee039b13%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/550c4629-1a82-4940-9acb-f00f094f22ab%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/550c4629-1a82-4940-9acb-f00f094f22ab%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smReu8CeymcXEDpY80QLdKat2Rg-k6%3DtKAOtvFwasEYye2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Diego De Freitas) #7

Do you know if it is possible to monitor this overhead over the second ?

Thanks


(system) #8