We have a usecase in our projects, where elasticsearch is used for search and need to send anywhere from 50,000-100,000 documents for further processing and then send to client.
The response document should contain minimally 1 field (document IDs) alone from elasticsearch.
With HTTP REST Client , we were able to transfer in 10+secs. Is it possible with any other ways to
transfer at-least 50,000 documents in < 1sec ?
Ingest performance will depend on size and complexity of documents, mappings used, bulk size, size and specification of cluster and hardware used. I have seen clusters able to index several hundred thousand documents per second so it is possible. If you have very high peaks in throughput you may need a larger cluster than expected.
If you tell us a bit more about the data we can probably give a better answer.
@Christian_Dahlqvist Thanks for the info.
My problem i am describing is not in ingestion of records but during search.
The search query selects on an average 50,000-80,000 records and I would like to
know how fast I could transfer the whole result-sets for further processing.
In our cases , the search response took around 10+secs and documents contains
max of 20 json key-value pairs ~2KB.Hardware is like AWS Cloud 2 instances of c4.8xLarge.
with max documents of 5million records
I remember writing that answer to a different thread on my mobile, so am sorry it somehow ended up on the wrong thread without me noticing it.
Retrieving large volumes of documents is not necessarily what Elasticsearch was designed and optimized for and results in a lot of random disk reads if your indices are not fully cached by the operating system. If your data set is not fully cached by the operating system I would expect the response time to be more limited by disk speed than CPU.
In order to optimize this it would help if you could answer the following questions:
How large is your total data set?
How many indices/shards/replicas do you currently have in place? How much space on disk do these occupy?
If you run your queries and return only a small amount of documents, e.g. 1000, how long do they take to execute then?
How many concurrent queries of the type you mentioned do you need to support?
When retrieving large amounts of data it is recommended to use a scroll query. What settings are you currently using for yours?
Given that you are using c4 instances I assume you may be using EBS. If so, what is the size and specification of your EBS volumes?
Your statement "Retrieving large volumes of documents is not necessarily what Elasticsearch was designed" already answered the question.
But anyway , here is my rough settings:
Dataset is small ~5Million documents with 2Kb each size.
Total size of index ~20GB
2 Shards / 1 Replica / 5 c4.8xlarge nodes with 50GB EBS
To get 1000 documents it takes around 2-3secs
Right now , i am not using scroll query , just testing it out with "size" param in Kibana. But will scroll query
reduce it to <2secs to get 50-100k documents ?
Given the size of the box I assume that means the full data set is likely to be cached, meaning that disk I/O should not be a factor.
If retrieving relatively few documents takes that long and the data is cached, I do not see how you can make retrieval of more documents any faster without optimizing the query, structure of the documents and/or the mappings. Experimenting with the number of primary shards may help, but I am not sure how much difference that will make.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.