Is that good practice to manually merge sort the search result per shards?

richardxin · November 21, 2017, 1:01am

We are having a debate internally on whether it's good practice to manually merge the search result per shards.
main index has 135 shards with single replica on 90 data nodes. The biggest query we are running is "Top N" out of M where N is 3million and M is 12million. We found ES is very slow on some of the queries.
one of our engineers suggested we do manually merge-sort, i.e. one slice call per shard, then manually implement a merge sort logic.
I am not convinced that's the right direction, feel like we are reinventing wheel, anything I missed that we could get benefit from doing manual merge? Anyone doing similar things? Thanks!

dadoonet · November 21, 2017, 3:21am

feel like we are reinventing wheel,

I agree with that impression.

Would be better first to try to fix the query if possible.
May be share it here?

richardxin · November 21, 2017, 5:50pm

@dadoonet thanks for you reply.
the key problem we are facing is deep pagination, we sort and need to persist up to 3 million records to somewhere in single or 2 blobs, any best practice to accomplish this?

dadoonet · November 21, 2017, 9:34pm

the key problem we are facing is deep pagination

Don't do it or use Search After | Elasticsearch Reference [6.0] | Elastic or scroll API

we sort and need to persist up to 3 million records to somewhere in single or 2 blobs

What do you mean by "blobs"?

richardxin · November 21, 2017, 10:16pm

we do use scroll API, and we just store the 3m sorted result {id, score} into 2 separate files(sorted ids and all scores)

dadoonet · November 22, 2017, 2:59am

So what is the problem? I’m confused by the initial question I guess.

richardxin · November 22, 2017, 3:45am

so 2 of the possible solution for top N is: (N differs, could as much as 3 million)

assuming we want keep index.max_result_window to the default value 10000 to avoid blowing up heap, in order to get 3million sorted result, we could have a loop to call scroll api and append one after another.

OR
2) create an application that slice top N call into 135 calls concurrently(we have 135 shards, so 1 per shards), as a result we will get list of 135 pre-sorted list, we merge them until target N reached. during the merge, if any of the list exhausted, we should have mechanism is fetch next page of the shard call.

Any suggestion which solution is better 1 or 2? or any better solutions? Thanks

dadoonet · November 22, 2017, 6:13am

Why do you need to extract 3m documents?

Are you still indexing while doing that?

richardxin · November 22, 2017, 4:36pm

thats one of the requirement of legacy application, this requirement is not negotiable

Let assume "No" for now.

dadoonet · November 22, 2017, 4:58pm

this requirement is not negotiable

Well. Ok.
I found myself in former companies that I have to understand the real need that users are trying to express. Like

User: "I want to export data to Excel".
Me: "Well. That's not a business need. That's the way you think you should solve your problem, but what if I come with some even more efficient that actually suit your needs?"

So, in that case, scrolling in parallel multiple parts of the data, let's say instead of extracting with one scroll call a year of data, extracting 12 months in parallel makes sense to me.

Scroll can be improved though if you sort by _doc, like:

GET /_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}

system · December 20, 2017, 4:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How get the just enough information when manually merge the search result by shards? Elasticsearch	4	1490	July 5, 2017
Handle big result set? Elasticsearch	7	4957	July 6, 2017
Deep Pagination with scroll(100 millions of docs) could be a problem? Elasticsearch	7	9384	February 11, 2017
Deep pagination and multiple shards Elasticsearch	1	480	July 6, 2017
Possible bug with deep pagination Elasticsearch	6	2252	June 8, 2017

Is that good practice to manually merge sort the search result per shards?

Related topics