Search 1M data in elasticsearch using pagination

I have loaded around 1TB of data on elasticsearch DB.
For searching I tried following ways -

  1. "from+size" - It has default value of index.max_result_window as 10000, but I wanted to search from 100000, hence I set index.max_result_window to 100000. Then searched from 100000 and size=10, but it causes heap size full.
  2. Scroll API - We need to specify time window for keeping search context alive and in order to keep
    the older segments alive more file handles are required. hence it again consumes the memory configured in the nodes of the cluster.
  3. search_after - I tried sorting documents on basis of _uid, but it gives me following error -

{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[fielddata] Data too large, data for [_uid] would be [13960098635/13gb], which is larger than the limit of [12027297792/11.2gb]",
"bytes_wanted": 13960098635,
"bytes_limit": 12027297792
}
}
},

What can be done to resolve this error and also which is the most efficient way to search a large chunk of data (i.e.100000 or more) through pagination?

1 Like

Scroll API is the proper way for deep pagination. The problem with search_after is that it's stateless... it returns the results of the index as they exist at the time of each execution. Meaning that ongoing updates/deletes/new documents will appear in the next pagination request and potentially mess up the order, duplicate results, etc.

Scrolling is the tool for deep pagination. Keeping search contexts alive is not necessarily memory-hungry, it's simply telling ES which segments to prevent from merging. And presumably this deep pagination process is a "background job", not something that hundreds of users are accessing simultaneously.

It does come with some overhead, but nothing is free and that's the cost of scrolling :slight_smile:

Thanks Zachary ! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.