We are trying to use es as the data store for our UI. One of the major
requirements for this is sorting and pagination of sorted data. Is there any
inbuilt support in es for pagination? I tried scroll, but it seems it does
not work for sorted data. e.g, I have 10 million records, which I want to
sort on the basis of a particular (numeric) field, and I show 20
results/page. So if the user clicks on page 5, I'd like to show results
81-100 directly.
Also, how does es do sorting on numeric fields? I am assuming it puts
records into buckets of different ranges. (I am completely new to
indexing/search). Also, is fetching the top 20 results of a set faster than
fetching the top 1000, or fetching results 4001-4010, for example?
From the HTTP API, size and from should do the trick for pagination:
Sort fields are pulled into memory which makes the sort operations
quite fast. Not sure the exact method used, though. Also, my
experience with pagination is that it is quite fast and I haven't
noticed any performance degradation, even paginating beyond the
50,000th result.
We are trying to use es as the data store for our UI. One of the major
requirements for this is sorting and pagination of sorted data. Is there any
inbuilt support in es for pagination? I tried scroll, but it seems it does
not work for sorted data. e.g, I have 10 million records, which I want to
sort on the basis of a particular (numeric) field, and I show 20
results/page. So if the user clicks on page 5, I'd like to show results
81-100 directly.
Also, how does es do sorting on numeric fields? I am assuming it puts
records into buckets of different ranges. (I am completely new to
indexing/search). Also, is fetching the top 20 results of a set faster than
fetching the top 1000, or fetching results 4001-4010, for example?
But how would from+size work with sorting? The ids would not be sorted once
we sort based on another column right? The way we are thinking right now is,
for example, to go to the second page, to check the 20th value of the sorted
field, and use this value in a "from" range query for the next request. But
it would be cumbersome to do this correctly when the sorted field is not
necessarily unique. Also, handling it when there are multiple sort fields
will be cumbersome.
Sort fields are pulled into memory which makes the sort operations
quite fast. Not sure the exact method used, though. Also, my
experience with pagination is that it is quite fast and I haven't
noticed any performance degradation, even paginating beyond the
50,000th result.
We are trying to use es as the data store for our UI. One of the major
requirements for this is sorting and pagination of sorted data. Is there
any
inbuilt support in es for pagination? I tried scroll, but it seems it
does
not work for sorted data. e.g, I have 10 million records, which I want to
sort on the basis of a particular (numeric) field, and I show 20
results/page. So if the user clicks on page 5, I'd like to show results
81-100 directly.
Also, how does es do sorting on numeric fields? I am assuming it puts
records into buckets of different ranges. (I am completely new to
indexing/search). Also, is fetching the top 20 results of a set faster
than
fetching the top 1000, or fetching results 4001-4010, for example?
On Fri, 2011-06-24 at 10:51 +0530, Hari Shankar wrote:
But how would from+size work with sorting? The ids would not be sorted
once we sort based on another column right? The way we are thinking
right now is, for example, to go to the second page, to check the 20th
value of the sorted field, and use this value in a "from" range query
for the next request. But it would be cumbersome to do this correctly
when the sorted field is not necessarily unique. Also, handling it
when there are multiple sort fields will be cumbersome.
Just to be clear, the from field takes a position, not an ID, so:
And sort order is preserved even when the sort value is not unique.
However, the number of docs that need to be processed in order to return
(eg) page 50 is 500 * no_of_shards = 2500 (assuming 5 primary shards).
So you really don't want to offer to return page 5 million.
Do like google and max out at 1,000 results. Who WANTS to see page 5
million anyway?
If you need to retrieve all 5 million docs that match a query, eg to
reindex or export them, then use a scrolled search with search_type=scan
they won't be sorted, but it won't kill your ES server either
On Fri, 2011-06-24 at 10:51 +0530, Hari Shankar wrote:
But how would from+size work with sorting? The ids would not be sorted
once we sort based on another column right? The way we are thinking
right now is, for example, to go to the second page, to check the 20th
value of the sorted field, and use this value in a "from" range query
for the next request. But it would be cumbersome to do this correctly
when the sorted field is not necessarily unique. Also, handling it
when there are multiple sort fields will be cumbersome.
Just to be clear, the from field takes a position, not an ID, so:
And sort order is preserved even when the sort value is not unique.
However, the number of docs that need to be processed in order to return
(eg) page 50 is 500 * no_of_shards = 2500 (assuming 5 primary shards).
So you really don't want to offer to return page 5 million.
Do like google and max out at 1,000 results. Who WANTS to see page 5
million anyway?
If you need to retrieve all 5 million docs that match a query, eg to
reindex or export them, then use a scrolled search with search_type=scan
they won't be sorted, but it won't kill your ES server either
Also note that elasticsearch takes special care to try and optimize "long tail" pagination (though there is a limit, of course). The "query_then_fetch" type makes sure to only fetch doc ids from all shards to do the pagination calculation, and only them goes and fetch the relevant docs needed.
On Friday, June 24, 2011 at 1:18 PM, Hari Shankar wrote:
On Fri, 2011-06-24 at 10:51 +0530, Hari Shankar wrote:
But how would from+size work with sorting? The ids would not be sorted
once we sort based on another column right? The way we are thinking
right now is, for example, to go to the second page, to check the 20th
value of the sorted field, and use this value in a "from" range query
for the next request. But it would be cumbersome to do this correctly
when the sorted field is not necessarily unique. Also, handling it
when there are multiple sort fields will be cumbersome.
Just to be clear, the from field takes a position, not an ID, so:
And sort order is preserved even when the sort value is not unique.
However, the number of docs that need to be processed in order to return
(eg) page 50 is 500 * no_of_shards = 2500 (assuming 5 primary shards).
So you really don't want to offer to return page 5 million.
Do like google and max out at 1,000 results. Who WANTS to see page 5
million anyway?
If you need to retrieve all 5 million docs that match a query, eg to
reindex or export them, then use a scrolled search with search_type=scan
they won't be sorted, but it won't kill your ES server either
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.