Given a set of recently updated documents (indexed without waiting for refresh), is there a way to check if they have been refreshed, or to wait until they have been refreshed (without calling _refresh)?
Waiting for x seconds doesn’t guarantee that the refresh has completed on all shards, especially if the cluster is under high load.
Searching for a document and finding the up-to-date version doesn’t guarantee that a subsequent search is not routed to a replica shard that still holds a stale version.
Does the following method work?
Call the index API with wait_for to update a dummy document.
If the call returns, assume that a complete index refresh has happened on all shards (and not just the replica shards that store the dummy document!), and so all documents should be visible to search.
Is this assumption correct? Or is there a better solution?
Yes. This is correct. The refresh call behind the scene will be for all the documents that have been added so far.
So you can call indeed the index API with the arbitrary document with the wait_for option.
As you are holding a temporal list of inserted documents on your side you can also _search for the latest document every x seconds and once the number of hits is 1, you can assume that the _refresh has happened.
Otherwise, there's a nice BulkIngester API in Java, and in other languages as well that you could use:
try (final BulkIngester<Void> ingester = BulkIngester.of(b -> b
.client(client)
.globalSettings(gs -> gs
.index(indexName)
.refresh(Refresh.True)
)
.listener(new BulkListener<>() {
@Override
public void beforeBulk(long executionId, BulkRequest request, List<Void> voids) {
logger.debug("going to execute bulk of {} requests", request.operations().size());
}
@Override
public void afterBulk(long executionId, BulkRequest request, List<Void> voids, BulkResponse response) {
logger.debug("bulk executed {} errors", response.errors() ? "with" : "without");
}
@Override
public void afterBulk(long executionId, BulkRequest request, List<Void> voids, Throwable failure) {
logger.warn("error while executing bulk", failure);
}
})
.maxOperations(10)
.maxSize(10_000)
.flushInterval(10, TimeUnit.SECONDS)
)) {
final var data = BinaryData.of("{\"foo\":\"bar\"}".getBytes(StandardCharsets.UTF_8), ContentType.APPLICATION_JSON);
for (int i = 0; i < size; i++) {
ingester.add(bo -> bo.index(io -> io.document(data)));
}
}
That code means that every 10 seconds or every 10000 documents, a bulk request will be sent and refresh will be called. The Listener will be "aware" of it, meaning that you can then empty your cache.
“Wait for the changes made by the request to be made visible by a refresh before replying”
The specific request may change only a subset of shards, so only that subset of shards would be guaranteed to be refreshed. Unless I am missing something.
I dont think it would be too hard to validate this via script.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.