Update by query and refresh

Hello,

We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.

We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.

Nevertheless we want to solve the issue. Right now I can see two solutions:

  1. Migrate _update_by_query to update by _id where possible (this works as
    documents are *gettable *by id right after they are indexed
  2. Issue refresh before all _update_by_query operations

The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57e7df7c-b6ec-4af5-bc83-37880df974c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Igor,

It really depends on your indexing rate. If you plan on performing no more
than one refresh per second, things will be fine (this is what
elasticsearch does by default). However, running refresh much more often
could cause a lot more flush/merge activity, and this will hurt not only
your index rate but also your search rate because of all these new segments
that will keep on being published. I don't really have a solution to this
issue, this is a hard problem.

On Mon, Oct 20, 2014 at 11:58 AM, Igor Kupczyński puszczyk@gmail.com
wrote:

Hello,

We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.

We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.

Nevertheless we want to solve the issue. Right now I can see two solutions:

  1. Migrate _update_by_query to update by _id where possible (this works as
    documents are *gettable *by id right after they are indexed
  2. Issue refresh before all _update_by_query operations

The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/57e7df7c-b6ec-4af5-bc83-37880df974c9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/57e7df7c-b6ec-4af5-bc83-37880df974c9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4cbQ4T-pb6wUoqQdytGsoutVn-LQMiQQh0CaKxVTCPWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Adrien, thanks for the answer. I'll start by issueing refresh where
needed. If the rate will be killing the performance I'll simply throttle it.

Thanks,
Igor

On Monday, 20 October 2014 11:58:36 UTC+2, Igor Kupczyński wrote:

Hello,

We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.

We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.

Nevertheless we want to solve the issue. Right now I can see two solutions:

  1. Migrate _update_by_query to update by _id where possible (this works as
    documents are *gettable *by id right after they are indexed
  2. Issue refresh before all _update_by_query operations

The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbb08a8a-ce2e-4e65-8c0a-042a4b93d2d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.