Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
What is the recommended solution for real time user requests which return over 10,000 items from ES?
The use case: A user sends a custom request to Elasticsearch using the Java API. Over 10,000 records are returned as part of the search response. My current implementation using scrolling takes too long- the query itself runs quickly, but the process of iterating over the scroll response is too slow.
I'm sure I can continue to optimize my code, but is there any way around using a scroll response to return over 10,000 documents?
And coming back to the scroll response documentation- What is the best practice for returning over 10,000 documents in real-time if scrolling is not intended for real-time requests?
The Changes API wouldn't be relevant here, but I haven't heard of that before. I see a Github chain about that but no documentation, is there another link you can send me?
A user sends a request for data they need in real-time. i.e. A user needs every single document in our index which matches the termQuery("foo", "bar"). About 3 million documents match that, and every single one of those documents needs to be returned in real-time (or near real-time).
Last I remember Elasticsearch caps the number of documents returnable by a searchResponse to 10,000. To go over that limit you need to use scrolling. Scrolling is not intended for real-time user requests (which is clear to me now after seeing how slow it is), but I HAVE to use scrolling if I want to return all 3 million documents.
EDIT: Yes I have thought about using several "from/size" queries, but that felt a little hacky. Has the team at ES has dealt with the use case where over 10,000 documents need to be returned in real-time? Our indexing process can be slow, querying is more important.
It's relevant in our use case. I am reading now that we can set the index.max_result_window to a value greater than the default of 10,000 which may be the easiest way to handle this use case.
Are there any articles or blog posts you can point me to on real-time search optimizations for Elasticsearch?
Using scroll is the best and most efficient way to get large amounts of data out of the cluster. What is the specification of your cluster? What type of storage do you have? What load is the cluster under when you run this? Are you monitoring disk metrics and performance?
To elaborate on this, it's highly unlikely all 3 million results are ever going to be useful. Even google won't return every single possible result available because people rarely look beyond a few pages of results.
So if we can understand why this is important then it'll really help with any advice.
What does a user do with 3 million documents? What is the use case?
In our use case, the results will be used by a system to perform calculations that are immediately important. The user does nothing with them, a system uses them. Often times, and even in the Engineer training, the use case I've seen for Elasticsearch is a shopping page or an interface in which a user (aka Human) immediately looks at/uses the documents returned. I think that may be why the idea of needing 3 million documents in real-time seems excessive, but I can promise @warkolm that every document is important. Does that help?
What is the specification of your cluster?
The entire cluster is running on default settings. We are not monitoring disk metrics, performance or load. What do you mean by "What storage do you have?"
What is the specification of the hosts Elasticsearch is deployed on? Are you using local SSDs, spinning disks or maybe SAN? How many nodes do you have in the cluster? How much data?
OS: Windows 10 Enterprise
RAM: 32GB
Processor: Intel(R) Xeon(R) CPU E3-1535M v5 @2.90GHz
System type: 64-bit operating system, x64-based processor
Hard Drive: Samsung SSD PM810 2.5" 256GB
Data: 25mb - 300mb (varies based on specific solution, but I'll say 25mb if you need one number)
Node count: 1
Document count: 85k - 10 million (varies based on specific solution, but I'll say 85k if you need one number)
I'm giving you the lower bounds to see if this system can be optimized on those. If so, I'll look at how it performs under a larger load.
Are we trying to determine if this is a resource issue? When I find a way to monitor disk performance and CPU usage on windows, what sort of metrics are you looking for?
I am wondering if it is a resource issue as you only have one node. It would be interesting to see what the limit is at the moment. Even though you will be able to tune it and scale up/out, retrieving lots of documents from Elasticsearch will probably require a lot of random disk reads and decompression of the source, so I am not sure how fast you will be able to get it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.