We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.
We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.
Nevertheless we want to solve the issue. Right now I can see two solutions:
Migrate _update_by_query to update by _id where possible (this works as
documents are *gettable *by id right after they are indexed
Issue refresh before all _update_by_query operations
The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.
It really depends on your indexing rate. If you plan on performing no more
than one refresh per second, things will be fine (this is what
elasticsearch does by default). However, running refresh much more often
could cause a lot more flush/merge activity, and this will hurt not only
your index rate but also your search rate because of all these new segments
that will keep on being published. I don't really have a solution to this
issue, this is a hard problem.
On Mon, Oct 20, 2014 at 11:58 AM, Igor Kupczyński puszczyk@gmail.com
wrote:
Hello,
We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.
We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.
Nevertheless we want to solve the issue. Right now I can see two solutions:
Migrate _update_by_query to update by _id where possible (this works as
documents are *gettable *by id right after they are indexed
Issue refresh before all _update_by_query operations
The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.
Hi Adrien, thanks for the answer. I'll start by issueing refresh where
needed. If the rate will be killing the performance I'll simply throttle it.
Thanks,
Igor
On Monday, 20 October 2014 11:58:36 UTC+2, Igor Kupczyński wrote:
Hello,
We use _update_by_query plugin to bulk update the documents. In the tests
we've hit an issue where not all the documents are updated because the
index is may not be refreshed before we do _update_by_query.
We have refresh interval set to 1 sec and this issue won't happen very
often in the real life, as usually there is a longer timeframe between
adding and updating a document.
Nevertheless we want to solve the issue. Right now I can see two solutions:
Migrate _update_by_query to update by _id where possible (this works as
documents are *gettable *by id right after they are indexed
Issue refresh before all _update_by_query operations
The latter solution will make us safe (_refresh is blocking and we'll wait
for confirmation before issuing update by's), but what is the performance
cost? Is it a major one? For 99% for update_by_queries the refresh is not
needed but we have no way to tell upfront.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.