I have an index and our post-indexing step, we re-index the index and after, the pipeline (written in Java) uses an aggregation query to fetch product attributes (for example categories and their counts), then we cache these values. I noticed after that the values between index and the cache are not same.
I believe that the problem is we try to fetch categories immediately after re-indexing and index may be not ready to be read (meaning, re-indexing may not be in a completely finished state). Some of the answers includes Refresh API with Java:
IndexRequest indexRequest = new IndexRequest(indexAlias);
indexRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);
but my operation is not actually a IndexRequest, it is SearchRequest since we are passing an aggregation query to the index.
How can I know that my index is fully ready to be read its most up-to-date data?
Refreshes are resource-intensive. To ensure good cluster performance, we recommend waiting for Elasticsearch’s periodic refresh rather than performing an explicit refresh when possible.
If your application workflow indexes documents and then runs a search to retrieve the indexed document, we recommend using the index API's refresh=wait_for query parameter option. This option ensures the indexing operation waits for a periodic refresh before running the search.
My re-indexing pipeline is a scheduled job and executes once a day. So, for each index, I am going to refresh only once per day, and it is just for an aggregation query to cache categories of documents. Does refreshing this way cause some problems for my users' searches after that?
Btw, I couldn't find refresh=wait_for option in Python. Do you know that can I use that with .indices().refres(index=index)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.