Point in time snapshot (PIT id), original query required when paging?

I implemented pagination inside an API using point in time snapshots for efficiency. In the first request a PIT is generated and in subsequent requests the PIT is being supplied with the request.

What I would like to do is split the API endpoint into two endpoints. The first endpoint is to get the PIT and the other endpoint is to scroll using the PIT. When looking at Elasticsearch examples the query is always send on subsequent searches together with the PIT.

But is this really necessary? Also what happens if you change the query is subsequent calls? Won't you get inconsistent results???

Point in time locks a version of all documents in the index, not a particular subset of documents from a query. The Lucene index is made up of segment files which never change. New or updated documents are placed in new segments. Deletes add a note about which files in existing segments are to be ignored. Fragmented segments are merged into newer segments (minus the deletes) and the old segments are finally deleted. Point in time simply calls a halt to any purging of a set of old segments and remembers which segments form this required view of the data. Searches on the pit-id only search these old segments while all other "normal" searches look at the newer segment files. When the point in time view is no longer required merging+purging can resume on these older segments.

Thanks for the explanation, required some research on my part to completely understand.

Clearly you know what you are talking about, so I would like to ask you what to do in the following situation:

Let's say your trying to implement pagination for a website with many active users let's say 100000. The initial search query is quite heavy, let's say 2 seconds to complete.

Clearly PIT id's will give you consistency but will still be slow.
Scroll id's are not advised either given the number of users.

So how would you solve this problem? Or should we just give up on pagination and implement it client side? Trading bandwidth for memory.

It's not viable when dealing with many users and an ever-changing dataset - this is the reason Google etc don't let you page endlessly into results.
In many cases it is better to offer facets (see aggregations) or filters ("last week/month/year") etc as tools for users to trim the long tail rather than offering to scroll its full extent.

1 Like

Thnx for the info, I see your point, but even if we limit the number of search results, what would be the best way to implement scrolling given a high number of users?

See the search_after parameter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.