How to do parallel filter search of 1M queries in minimal time?

Hi, I would like to search 1M filter queries in parallel in order to reduce the search time. This means a boolean query with 1M filter values, an no ordering is necessary. My index size is about 1GB, and I've stored everything in one shard, one node at the moment. I'm currently using searchAfter on 10k filter queries per request. So that means I will run 100 searchAfter queries in sequence with each query taking in 10k values (of the total 1M). This currently takes a long time (more than a couple minutes). I would like to consider parallelizing this search such that each of the 100 queries can run in parallel (e.g. on 100 client threads), and be aggregated within my application. What's the best way to do this? Some thoughts:

  • should I still use searchAfter if I want parallelism? or should i use multisearch? (I do not need ordering/sorting/pagination). Does multisearch also have a limit to max_result_window? The reason why I tried using searchAfter is to exceed max_result_window.
  • should I create shards to introduce parallelism at my index size? Elasticsearch documentation recommends shard sizes between 10GB and 50GB however, and my index size is only 1GB.
  • can I introduce hypothetically 100 replicas of the 1GB shard/node and create 100 search requests in parallel? Theoretically if I have a pool of 100 workers together serving a total of 100 simultaneous searches, with each search taking 10k queries, then each worker would only need to process one search operation.

What does resource usage look like while you are querying? With a 1GB index/shard I would expect the full data set to be cached in memory assuming you have enough RAM available to the node? How much RAM and CPU cores do you have available? What does heap and CPU usage look like while you are querying?

No. As you will have lots of queries running in parallel increasing the number of primary shards is likely to reduce performance rather than increase it.

You can run many searches in parallel against the single shard, either by using multiple client threads or multisearch. If system resources are maxed out you can add nodes and increase the number of replicas so all nodes jold a full copy of the index/shard.

This sound like an unusual set of requirements. What is the use case?

@Christian_Dahlqvist I have variable number of RAM/CPU, and expanded it such that I am not saturating these resources. My use case is quite unusual. I have a large list of filter values, and I want to filter by them all as fast as possible, getting just the ID's of those documents. "VALUE1" OR "VALUE2" (where both values are exact terms, that can be filtered, and no scoring is necessary)

I have some questions before using multisearch.

Is multisearch bounded by max_result_window for each request, or all requests within the multisearch (the entire multisearch)? I want to return more than the max_result_window (which is why I switched to searchAfter originally) in total across all "subrequests" of the multisearch.

Second, is it possible to customize _source field on search results with multisearch? This appears to be in the documentation for search and searchAfter, but not multisearch.

How many documents do you have or plan to have? How many documents can each individual term match? Can a document match multiple terms?

about 1M-10M documents per index. we only have 1 index right now.
for this query/filter, each term can match a few (handful of) documents.
a document will not match multiple terms, unless a term is repeated in the query.

This does not sound like something Elasticsearch is optimised for so I am not sure how much it is possible to speed it up. I am afraid I do not have any good suggestions.

1 Like