Can I trust a PIT id once it has been replaced by a new PIT id?

Hello there,

I was wondering how the Point In Time works internally and if it is trustable for parallelization.

Context

Given this implementation:

  • I have a search request to retrieve a list of offers. It may return hundreds of thousands documents.
  • One 'Dispatcher' will paginate X elements and loop over the results using the PIT and search_after parameters.
  • The Dispatcher will create chunks of ids from the response and will produce RabbitMQ messages with the current PIT id and the document ids.
  • RabbitMQ consumers will consume each message and call ES with their PIT id and the document ids to retrieve the data they need from the view.

When the Dispatcher loop, it can get a new PIT id from ES while the consumers may consume messages with an "outdated" PIT id.

As stated by the documentation: "The open point in time request and each subsequent search request can return different id; thus always use the most recently received id for the next search request."

The question is, what happens to the previous view based on the PIT id that has been replaced?

Can my consumers still search using their PIT id, for a short time, to get the results from the view or is it unsafe / not consistent / unpredictable?

I guess not and will have a look to sliced scrolls but if someone has more insights, it would be welcomed.

Thank you

If during a search request some shard used in the original PIT is not available (e.g. went offline), elasticsearch will attempt to use a different copy of this shard if there is one available and has the same commit history. Currently even if we retry with different shards, we will still return the same PIT, but in the future we may change this.

Answering your question – yes you can trust a new PIT id or shard replacement, because we make sure to use a shard for replacement that has the same commit history.

1 Like

Hey Mayya,

Thanks for the quick response and the details but I am not sure it answers my problem, or I may have missed something.

If I may add an example:

  • My first search returns the PIT Id AAAAA
  • I send an immutable message to my queue with this PIT Id for later use (job parallelization)
  • My second search with the PIT Id AAAAA returns a new PIT Id ZZZZZ, for the reason you mentioned
  • 20s later, the consumer of my queue runs a search on ES with the PIT Id AAAAA

So my question is can I still rely on the result of the query with PIT Id AAAAA despite the PIT Id having been superseded by ZZZZZ ~20 seconds ago?

Thanks for more details.

The answer to your question is yes, that the guarantee of PIT – a certain point of time in the index, regardless of what copies of shards are being used internally. If we can't find this point of time any more, an error of "no search context found" will be returned.

But as I said in the previous answer, currently we always return the same PIT id even if we end up using different shard copies.

1 Like

Okay, if I get it correctly, the PIT Id points to the context of a specific PIT. If it does not exists anymore, an error is returned. Until then I can still trust both snapshots to hold the same data.

Thanks.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.