Point in time snapshot (PIT id), original query required when paging?

tamis-laan · November 24, 2021, 5:33pm

I implemented pagination inside an API using point in time snapshots for efficiency. In the first request a PIT is generated and in subsequent requests the PIT is being supplied with the request.

What I would like to do is split the API endpoint into two endpoints. The first endpoint is to get the PIT and the other endpoint is to scroll using the PIT. When looking at Elasticsearch examples the query is always send on subsequent searches together with the PIT.

But is this really necessary? Also what happens if you change the query is subsequent calls? Won't you get inconsistent results???

Mark_Harwood · November 25, 2021, 9:47am

Point in time locks a version of all documents in the index, not a particular subset of documents from a query. The Lucene index is made up of segment files which never change. New or updated documents are placed in new segments. Deletes add a note about which files in existing segments are to be ignored. Fragmented segments are merged into newer segments (minus the deletes) and the old segments are finally deleted. Point in time simply calls a halt to any purging of a set of old segments and remembers which segments form this required view of the data. Searches on the pit-id only search these old segments while all other "normal" searches look at the newer segment files. When the point in time view is no longer required merging+purging can resume on these older segments.

tamis-laan · December 3, 2021, 9:27am

Thanks for the explanation, required some research on my part to completely understand.

Clearly you know what you are talking about, so I would like to ask you what to do in the following situation:

Let's say your trying to implement pagination for a website with many active users let's say 100000. The initial search query is quite heavy, let's say 2 seconds to complete.

Clearly PIT id's will give you consistency but will still be slow.
Scroll id's are not advised either given the number of users.

So how would you solve this problem? Or should we just give up on pagination and implement it client side? Trading bandwidth for memory.

Mark_Harwood · December 3, 2021, 9:50am

It's not viable when dealing with many users and an ever-changing dataset - this is the reason Google etc don't let you page endlessly into results.
In many cases it is better to offer facets (see aggregations) or filters ("last week/month/year") etc as tools for users to trim the long tail rather than offering to scroll its full extent.

tamis-laan · December 3, 2021, 1:17pm

Thnx for the info, I see your point, but even if we limit the number of search results, what would be the best way to implement scrolling given a high number of users?

Mark_Harwood · December 3, 2021, 4:45pm

See the search_after parameter.

system · December 31, 2021, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Point In Time for multi threaded queries / slices Elasticsearch point-in-time	1	313	November 7, 2022
Using PIT for searching but update a doc an be able to see it change Elasticsearch	4	332	June 9, 2021
Slicing without point in time Elasticsearch point-in-time	1	398	September 24, 2023
Pagination with PIT query required on each request? Elasticsearch	2	340	December 23, 2021
PIT on alias with filter Elasticsearch point-in-time	1	254	July 6, 2022

Point in time snapshot (PIT id), original query required when paging?

Related topics