Real time scrolling solution

TameemSamawi · July 16, 2018, 3:23pm

The documentation on scrolling explicitly states:

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

What is the recommended solution for real time user requests which return over 10,000 items from ES?

dadoonet · July 16, 2018, 7:52pm

What do you mean? I mean what is the use case?

Does a user want to see more than 10000 records as fast as possible?

Are you looking for something like a Changes API?

TameemSamawi · July 16, 2018, 8:06pm

The use case: A user sends a custom request to Elasticsearch using the Java API. Over 10,000 records are returned as part of the search response. My current implementation using scrolling takes too long- the query itself runs quickly, but the process of iterating over the scroll response is too slow.

I'm sure I can continue to optimize my code, but is there any way around using a scroll response to return over 10,000 documents?

And coming back to the scroll response documentation- What is the best practice for returning over 10,000 documents in real-time if scrolling is not intended for real-time requests?

The Changes API wouldn't be relevant here, but I haven't heard of that before. I see a Github chain about that but no documentation, is there another link you can send me?

dadoonet · July 16, 2018, 8:18pm

Why do you want to get back more than 10000 documents? Sorry I did not get that from your answer.

TameemSamawi · July 16, 2018, 8:43pm

No worries!

A user sends a request for data they need in real-time. i.e. A user needs every single document in our index which matches the termQuery("foo", "bar"). About 3 million documents match that, and every single one of those documents needs to be returned in real-time (or near real-time).

Last I remember Elasticsearch caps the number of documents returnable by a searchResponse to 10,000. To go over that limit you need to use scrolling. Scrolling is not intended for real-time user requests (which is clear to me now after seeing how slow it is), but I HAVE to use scrolling if I want to return all 3 million documents.

EDIT: Yes I have thought about using several "from/size" queries, but that felt a little hacky. Has the team at ES has dealt with the use case where over 10,000 documents need to be returned in real-time? Our indexing process can be slow, querying is more important.

dadoonet · July 16, 2018, 8:55pm

What is unclear to me is what the user will do with 3 millions records on his side.

TameemSamawi · July 16, 2018, 9:00pm

It's relevant in our use case. I am reading now that we can set the index.max_result_window to a value greater than the default of 10,000 which may be the easiest way to handle this use case.

Are there any articles or blog posts you can point me to on real-time search optimizations for Elasticsearch?

dadoonet · July 16, 2018, 9:19pm

It's relevant in our use case

That's what I want to understand. What is the use case? What does a user do with 3 million records?

Increasing the default value will slow down your search, will put a lot of memory pressure on your nodes.

Christian_Dahlqvist · July 16, 2018, 9:38pm

Using scroll is the best and most efficient way to get large amounts of data out of the cluster. What is the specification of your cluster? What type of storage do you have? What load is the cluster under when you run this? Are you monitoring disk metrics and performance?

warkolm · July 16, 2018, 9:51pm

To elaborate on this, it's highly unlikely all 3 million results are ever going to be useful. Even google won't return every single possible result available because people rarely look beyond a few pages of results.

So if we can understand why this is important then it'll really help with any advice.

TameemSamawi · July 17, 2018, 2:28pm

Thank you all for your answers.

What does a user do with 3 million documents? What is the use case?
In our use case, the results will be used by a system to perform calculations that are immediately important. The user does nothing with them, a system uses them. Often times, and even in the Engineer training, the use case I've seen for Elasticsearch is a shopping page or an interface in which a user (aka Human) immediately looks at/uses the documents returned. I think that may be why the idea of needing 3 million documents in real-time seems excessive, but I can promise @warkolm that every document is important. Does that help?

What is the specification of your cluster?
The entire cluster is running on default settings. We are not monitoring disk metrics, performance or load. What do you mean by "What storage do you have?"

Christian_Dahlqvist · July 17, 2018, 2:32pm

What is the specification of the hosts Elasticsearch is deployed on? Are you using local SSDs, spinning disks or maybe SAN? How many nodes do you have in the cluster? How much data?

TameemSamawi · July 17, 2018, 3:04pm

Specifications:

OS: Windows 10 Enterprise
RAM: 32GB
Processor: Intel(R) Xeon(R) CPU E3-1535M v5 @2.90GHz
System type: 64-bit operating system, x64-based processor
Hard Drive: Samsung SSD PM810 2.5" 256GB

Data: 25mb - 300mb (varies based on specific solution, but I'll say 25mb if you need one number)
Node count: 1
Document count: 85k - 10 million (varies based on specific solution, but I'll say 85k if you need one number)

I'm giving you the lower bounds to see if this system can be optimized on those. If so, I'll look at how it performs under a larger load.

Christian_Dahlqvist · July 17, 2018, 3:19pm

Have you monitored disk performance and CPU usage while running a reasonably large scroll query?

TameemSamawi · July 17, 2018, 3:28pm

Not yet. Are we talking about using the Nodes stats API to do so?

Christian_Dahlqvist · July 17, 2018, 3:30pm

No, I do't think that captures disk performance. Not sure how to best get that information on Windows as I am not a Windows user.

TameemSamawi · July 17, 2018, 3:34pm

Are we trying to determine if this is a resource issue? When I find a way to monitor disk performance and CPU usage on windows, what sort of metrics are you looking for?

Christian_Dahlqvist · July 17, 2018, 3:39pm

I am wondering if it is a resource issue as you only have one node. It would be interesting to see what the limit is at the moment. Even though you will be able to tune it and scale up/out, retrieving lots of documents from Elasticsearch will probably require a lot of random disk reads and decompression of the source, so I am not sure how fast you will be able to get it.

TameemSamawi · July 17, 2018, 3:47pm

Gotcha, this use case may be a limitation of Elasticsearch. Let me get those numbers to you within the next week.

Christian_Dahlqvist · July 17, 2018, 3:48pm

I have not tuned scroll queries, so there may be others with a better idea of what can and can't be achieved and what the limits are.

Topic		Replies	Views
If scrolling isn't recommended for user requests, what to use? Elasticsearch	2	779	July 5, 2017
Scrolling is not intended for real time user requests - why? Elasticsearch	10	1703	July 6, 2017
Pagination and real time indexing Elasticsearch	2	826	March 2, 2017
Search response accuracy Elasticsearch	2	332	August 30, 2019
Recommended Pagination Method for Real-time User Requests Elasticsearch	2	1064	February 9, 2017

Real time scrolling solution

Related topics