Which is better to use?

Hello everyone,
I am using ElasticSearch 5.4
I am currently working on feature that requires me to sort the distinct value according to how many times were they repeated in the document. and data could reach 200 thousand so pagination is needed as well

So the first scienaro used was Aggs & Sorting which worked up well for me but i lacked the pagination part
{
"size": 0,
"aggs": {
"Most Frequent Visits": {
"terms": {
"field": "output.keyword",
"order": {
"_count": "desc"
}

     }
  }

}
}

the next scienaro used was Collapse & (From & Size for Pagination) but when it comes to the sort i wanted to sort according to total number that comes from the hits part in the response which i couldnt reach to sort by it

{
"from": 0,
"size": 5,
"collapse": {
"field": "output.keyword",
"inner_hits": {
"name": "latest_record",
"size": 1,
"sort": [
{
"timestamp": "desc"
}
]
}
},
"sort": [
{
"timestamp": {
"order": "asc"
}
}
]
}

so does anyone have any ideas on what to use for such a situation
thanks

First question: why do you need 200,000 paginated results? No human is going to look through 200k results.

1 Like

Clinton_Gormley I am talking about the worst case scenario if the number is the things that bothers you so let say they are 1000 records and I want pagination them so how could i do so in elastic search or in Other words is there anyway for me to do so using any of the above scenarios

The number is important though as it changes the scale. By saying 200,000, you're indicating that the output is for consumption by machine, not humans, in which case you should probably be using a scrolled search instead.

For human consumption, you don't want to build a UI that allows you to paginate to the last result (which COULD be 200k long) because a bot will follow all of those links and bring your elasticsearch cluster to its knees (well it would have done before we added a limit of 10k results using search without scroll).

So given that this is for human consumption, my three suggestions stand:

  • either use a cardinality agg on the first page to figure out how many unique values there are for the field you're collapsing on, to determine how many pages there are
  • use continuous paging so that users can't click on page 42
  • or limit the resultset to (eg) a max of 1000 and if the users click on page 100 then tell them that duplicate results have been removed (which is google's strategy)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.