I was wondering how the Point In Time works internally and if it is trustable for parallelization.
Given this implementation:
- I have a search request to retrieve a list of offers. It may return hundreds of thousands documents.
- One 'Dispatcher' will paginate X elements and loop over the results using the
- The Dispatcher will create chunks of ids from the response and will produce RabbitMQ messages with the current PIT id and the document ids.
- RabbitMQ consumers will consume each message and call ES with their PIT id and the document ids to retrieve the data they need from the view.
When the Dispatcher loop, it can get a new PIT id from ES while the consumers may consume messages with an "outdated" PIT id.
As stated by the documentation: "The open point in time request and each subsequent search request can return different id; thus always use the most recently received id for the next search request."
The question is, what happens to the previous view based on the PIT id that has been replaced?
Can my consumers still search using their PIT id, for a short time, to get the results from the view or is it unsafe / not consistent / unpredictable?
I guess not and will have a look to sliced scrolls but if someone has more insights, it would be welcomed.
If during a search request some shard used in the original PIT is not available (e.g. went offline), elasticsearch will attempt to use a different copy of this shard if there is one available and has the same commit history. Currently even if we retry with different shards, we will still return the same PIT, but in the future we may change this.
Answering your question – yes you can trust a new PIT id or shard replacement, because we make sure to use a shard for replacement that has the same commit history.
Thanks for the quick response and the details but I am not sure it answers my problem, or I may have missed something.
If I may add an example:
- My first search returns the PIT Id
- I send an immutable message to my queue with this PIT Id for later use (job parallelization)
- My second search with the PIT Id
AAAAA returns a new PIT Id
ZZZZZ, for the reason you mentioned
- 20s later, the consumer of my queue runs a search on ES with the PIT Id
So my question is can I still rely on the result of the query with PIT Id
AAAAA despite the PIT Id having been superseded by
ZZZZZ ~20 seconds ago?
Thanks for more details.
The answer to your question is yes, that the guarantee of PIT – a certain point of time in the index, regardless of what copies of shards are being used internally. If we can't find this point of time any more, an error of "no search context found" will be returned.
But as I said in the previous answer, currently we always return the same PIT id even if we end up using different shard copies.
Okay, if I get it correctly, the PIT Id points to the context of a specific PIT. If it does not exists anymore, an error is returned. Until then I can still trust both snapshots to hold the same data.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.