How to ensure unique sorting with search_after in Elasticsearch 8?

Hello,

I am working on a tool that needs to retrieve large batches of records from an Elasticsearch index. The recommended method used to be the Scroll API, but it is now deprecated in favor of search_after as stated in the documentation.

I want to use sorting criteria that are agnostic to the document content. Sorting by @timestamp and _id seems appropriate, but sorting on _id is now disabled by default in Elasticsearch 8.

If I sort only by @timestamp, this value is not unique, which means I could miss some records.

So is there a way to efficiently retrieve large volumes of data (>10'000) while ensuring no records are skipped, using sorting criteria independent of document content?

Can you use the point in time API with search after? It adds a unique tiebreaker using the shard doc value.

If not, and you don't have a unique value then the flow that I use is:

  1. Get a search_after batch of 10k documents
  2. Grab the timestamp from the last document (aka max_timestamp)
  3. Iterate through the batch while timestamp < max_timestamp doing whatever processing is required
  4. On the last iteration grab the sort values for your next search_after call

This effectively means you're excluding the documents with max timestamp from processing and ensuring that all documents with max_timestamp will be present in your next search_after call

Thanks William. I tried to sort on tie_breaker_id as the example from the documentation but the query no longer returned any results (It is theoretically available, my instance is in version 8.17).

Anyway, I got a working solution by sorting on the _shard_doc field.