Checking if updated documents are visible to search

Given a set of recently updated documents (indexed without waiting for refresh), is there a way to check if they have been refreshed, or to wait until they have been refreshed (without calling _refresh)?

  • Waiting for x seconds doesn’t guarantee that the refresh has completed on all shards, especially if the cluster is under high load.
  • Searching for a document and finding the up-to-date version doesn’t guarantee that a subsequent search is not routed to a replica shard that still holds a stale version.

Does the following method work?

  • Call the index API with wait_for to update a dummy document.
  • If the call returns, assume that a complete index refresh has happened on all shards (and not just the replica shards that store the dummy document!), and so all documents should be visible to search.

Is this assumption correct? Or is there a better solution?

As I wrote on SOF:

Yes. This is correct. The refresh call behind the scene will be for all the documents that have been added so far.

So you can call indeed the index API with the arbitrary document with the wait_for option.

As you are holding a temporal list of inserted documents on your side you can also _search for the latest document every x seconds and once the number of hits is 1, you can assume that the _refresh has happened.

Otherwise, there's a nice BulkIngester API in Java, and in other languages as well that you could use:

try (final BulkIngester<Void> ingester = BulkIngester.of(b -> b
        .client(client)
        .globalSettings(gs -> gs
                .index(indexName)
                .refresh(Refresh.True)
        )
        .listener(new BulkListener<>() {
            @Override
            public void beforeBulk(long executionId, BulkRequest request, List<Void> voids) {
                logger.debug("going to execute bulk of {} requests", request.operations().size());
            }

            @Override
            public void afterBulk(long executionId, BulkRequest request, List<Void> voids, BulkResponse response) {
                logger.debug("bulk executed {} errors", response.errors() ? "with" : "without");
            }

            @Override
            public void afterBulk(long executionId, BulkRequest request, List<Void> voids, Throwable failure) {
                logger.warn("error while executing bulk", failure);
            }
        })
        .maxOperations(10)
        .maxSize(10_000)
        .flushInterval(10, TimeUnit.SECONDS)
)) {
    final var data = BinaryData.of("{\"foo\":\"bar\"}".getBytes(StandardCharsets.UTF_8), ContentType.APPLICATION_JSON);
    for (int i = 0; i < size; i++) {
        ingester.add(bo -> bo.index(io -> io.document(data)));
    }
}

That code means that every 10 seconds or every 10000 documents, a bulk request will be sent and refresh will be called. The Listener will be "aware" of it, meaning that you can then empty your cache.

PS: If you are not using bulk API, you should :wink:

My 2 cents.

Parallel discussion on SOF and here!

You can only assume a _refresh has happened on at least one shard. This is effectively same as this observation

What about refresh=wait_for on a dummy document? Can we also only assume that the _refresh has happened on the shard(s) that hold the dummy document?

well, the doc says for wait_for (my emphasis)

“Wait for the changes made by the request to be made visible by a refresh before replying”

The specific request may change only a subset of shards, so only that subset of shards would be guaranteed to be refreshed. Unless I am missing something.

I dont think it would be too hard to validate this via script.

Yes, you are right. I tested it, and wait_for indeed does not guarantee a complete refresh on all shards. :frowning:

1 Like

To be consistent with its own documentation is to be expected.

But my favorite proverb is “Trust, but verify!”