In the documentation, it mentions that vector functions are applied linearly to all documents matching a query, and that a filter should be applied to restrict the number of documents that are scanned linearly. In my use case I would like to filter by date range, however my index is so large that I could get millions of docs matching just one day. If a user wanted to query for a week or a month, it could match hundreds of millions of documents - clearly not something I would want to scan linearly.
Ideally I would like to randomly sample N documents that match my date range and then pass that limited set to the linear time vector function. Something like this:
To be clear, I am not asking how to use the search "size" parameter, rather I want to limit the inner query that passes its results to the script_score.
If you are willing to drop your requirement about a random sampling, you can do the following things:
An elasticsearch query request has terminate_after parameter -- the maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. But this doesn't allow you to produce random sampling, as the collection always starts with documents with lower internal IDs and progressing to documents with higher internal IDs; if your index doesn't change you will always get the the same documents.
Another way to do this is to put cosine_similarity in rescoring:
A very fast filter on range is executed and we apply an expensive cosine similarity calculation only to the first 1000 docs. Here there is no random sampling as well, you will get the same 1000 docs.
The only way to get a random sampling that I aware of is indeed apply random_score function. To get a random sampling you will need to apply this function to all documents ( or all documents selected by a filter) . But a good thing is that function is quite fast, so there should not be a problem applying it to millions of documents. So what you can do is use your function_score query with random_score function, and then rescore 1000 docs based on more expensive cosine_similarity function.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.