I was wondering how the Point In Time works internally and if it is trustable for parallelization.
Context
Given this implementation:
I have a search request to retrieve a list of offers. It may return hundreds of thousands documents.
One 'Dispatcher' will paginate X elements and loop over the results using the PIT and search_after parameters.
The Dispatcher will create chunks of ids from the response and will produce RabbitMQ messages with the current PIT id and the document ids.
RabbitMQ consumers will consume each message and call ES with their PIT id and the document ids to retrieve the data they need from the view.
When the Dispatcher loop, it can get a new PIT id from ES while the consumers may consume messages with an "outdated" PIT id.
As stated by the documentation: "The open point in time request and each subsequent search request can return different id; thus always use the most recently received id for the next search request."
The question is, what happens to the previous view based on the PIT id that has been replaced?
Can my consumers still search using their PIT id, for a short time, to get the results from the view or is it unsafe / not consistent / unpredictable?
I guess not and will have a look to sliced scrolls but if someone has more insights, it would be welcomed.
If during a search request some shard used in the original PIT is not available (e.g. went offline), elasticsearch will attempt to use a different copy of this shard if there is one available and has the same commit history. Currently even if we retry with different shards, we will still return the same PIT, but in the future we may change this.
Answering your question – yes you can trust a new PIT id or shard replacement, because we make sure to use a shard for replacement that has the same commit history.
The answer to your question is yes, that the guarantee of PIT – a certain point of time in the index, regardless of what copies of shards are being used internally. If we can't find this point of time any more, an error of "no search context found" will be returned.
But as I said in the previous answer, currently we always return the same PIT id even if we end up using different shard copies.
Okay, if I get it correctly, the PIT Id points to the context of a specific PIT. If it does not exists anymore, an error is returned. Until then I can still trust both snapshots to hold the same data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.