I am using python as elastic client
and I want to fetch only 1 document
I have noticed that the search speed for fetching one document is exactly the same as fetching a lots of documents.
In oracle sql there is a fetch command and it saves time in searches
is there something like that in elastic?
I want to limit my results and I want to do it right, so my searches will be fast
I’m using a simple filter query with two keys
There are many results for the query
But I want to fetch only one or some of the results and not all of them
I don’t care which documents will be fetched
I just want them to be part of the query results
How can I limit the number of results efficiently?
Thank you for answering
I’m using python client to extract data from Elasticsearch
I need to retrieve more than 10000 documents so I’m using scan/scroll
There is no parameter called terminateAfter
I know how to use it but I couldn’t see any option to use that in scan/scroll
Please help me,
If I ask for only 2 result I have to wait so much time
and so if someone limit its results I need to check if the limit is less than 10000 and if so I’m not using scroll, and I query the results with search function and size (the limit that was mentioned)
And if the results are not limited I’m using scroll?
Ok, I found that scroll can get terminate_after prm.
I have billions of records
I want to understand what happens behind the scenes
suppose I have 3 shards and each shard have 40,000 relevant query results
If I set terminate_after to 20,000
then each one of the 3 shards return 20 results
-> all of them together return 3 X 20,000 = 60,000 results to the coordinate node
suppose I have 1 master node, it process the results of each shard and assemble the query results.
then, scroll take piece by piece of the query results according to size in scroll (for example: 1000)
terminate after makes the query execution faster
when there is no terminate_after, the master node wait till it gets all the results from the shards
it can take a while if results number is high
and only after the master node have all the hits
scroll ask the data piece by piece
am I right ?
size in the query use internally in terminate after?
After using terminate after, can I tell the coordinate node to assemble just part of the data (for example - assemble till you have 5000 results) without using size(because its value is limited to 10000)?
after using terminate_after to limit results accurately I count the results that returned by yield and stop after reaching the desired number of results?
That's partially correct. Unless you have changed cluster wide setting max documents a query / shard can return is capped to 10K. Setting size > 10K will fail the query. Setting terminate_after > size is meaningless.
For a query that's matching these many documents, coordinator node will require significantly amount of heap to consolidate 30K documents. Memory also depends on what's fetched. If you fetch _source for every document vs few fields per document. That's why I suggested to evaluate if 10K is right threshold.
This should also answer OOM issue you posted on another topic.
Do you mean client node? I don't know how your cluster is set up. I usually do not route search / ingest via master node.
Coordinator node will wait until the last shard responds or times out. For sorted documents, last shard may have top documents.
Size is max number of documents to return. Coordinator node will request same number of documents from each shard. terminate_after stops query execution as soon as n matching documents found in a shard. They should be identical. If you set size=1000 and terminate_after = 2000 you will search extra 1000 documents per shard just to throw them. If you set other way, you run the risk if all your matches are in a single shard. Set both to same value.
There is no need to count. Coordinator node may collect (3*1000) from 3 shards but it will return as many as you requested in size (1000).
if I need to fetch 20 millions documents(each doc have 40 fields)
the coordinator node will require significantly amount of heap to consolidate 20 millions documents?
size of batch is 1000 in my code, is it ok? do I need to do benchmarking and maximize it?
but what if I need to fetch only 20 millions records out of 1 billion records?
I can't use size in the query because it is limited to 10000.
If I set terminate_after to 20 millions then size of batch can't be 20 millions
they can't be identical
and if there are 20 millions records that match the query results in each shard then it will be so wasteful, the coordinator will assemble the 60 millions records and I don't need 40 millions of them.
what do I need to do in that case so the fetch will be optimized?
why? suppose I want to limit my results number to 20 millions.
if I set terminate_after to 20 millions
If I wanted 20 millions results and each one of the 3 shards had that amount of results then the coordinator will assemble 60 millions results
then, the client get from the coordinator 60 millions results
If I do nothing to limit that number I get 60 millions results instead of 20 millions results
what is the solution for that case?
do you think I need to change scroll to 'search after' to get better performance and fetch accurate number of results?
please help me
But what if I need exactly 10000 records?
It’s ok to fetch it in one batch?
There will be a lot of queries that will need to return more than 10000 records
User will specify the limit of the number of search results that he wants
Suppose we have 100 millions records as an answer to query and the user specify that he wants only 37,500,000 of records out of 100 millions records
What do you think that I need
Scroll or search after?
I care about search as fast as it possible
For that you need scroll. If you however have large amounts of queries returning very large result sets this will result in a lot of random disk I/O and is likely to be slow. Elasticsearch was not designed for returning vary large volumes of data as it is a search engine so maybe it is not the right tool.
Quickly returning large result sets matching arbitrary queries will always be difficult and likely very expensive. I suspect you need to classify the queries you expect to support and look at indexing and query patterns.
Questions to ask are:
What is the minimal number of fields you need to be able to search in? What different classes if queries do you have? How many results do these need to return? What are the latency requirements? What is the expected query mix and concurrency? Are you indexing new documents only or do you also need to be able to delete and update? What is the expected indexing rate and mix?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.