Slow query for large size values


I have an ElasticSearch 6.4 index with 5 shards, 1 replica for each shard and 1.5 billions of documents.

I use Elastic from textual search, getting the result's ids andusing then to query on MongoDB.

We need get a roof of 350.000 documents for each query, and my problem start here. Setting query size between 20 and 50.000 documents, my search take less then 10 seconds to respond with more or less 10MB of non compressed JSON. But when I increase the size, the time increase exponential, and I can't get the results.

Any suggestions to resolve my problem? I can accept queries with 30 seconds, but I need a roof of 350.000 documents.

Extracting a lot of hits from elasticsearch consumes time.
Some few things you can do:

  • don't fetch the _source as you don't need it
  • use the scroll API. After 10000 documents, the _search will refuse to work
  • or use the search_after
  • sort by _doc if possible (depends on your use case)

Some questions: why do you need to load then 350 000 hits from MongoDB? What is the use case for that? Asking that because we can may be propose another alternative.

Thank you for your reply.

I will test use the scroll API, maybe can help me.

I sort my documents descending for created date, so I can't sort by _doc.

I need this number of documents because my data and the aggregations are stored and perform on MongoDB. And with empiric tests we saw that this magic number, 350.000, give us a good approximation for the user filters once the textual search on MongoDB are very bad. In the past, we were using Solr for textual search, but we migrate recently to ElasticSearch and we are facing this problem to obtain the same number of documents we received from Solr.

Why not just running the aggregation on elasticsearch side on the whole dataset instead of a subset?

We use aggregations heavily and the MongoDB performance was better than Elastic for them. Besides that all the legacy and new softwares are using MongoDB, and adapt all of then to Elastic was expensive and impossible for us.

Ok. Then you can't sadly do miracles with that mixed strategy.
Anyway, I hope that the small advices I gave you can help to reduce a bit the overall time needed.

I'd try now to compare a search in ES + fetching 350.000 partial results + computing the aggs on MongoDB side with a single run on elasticsearch side with size: 0...
I'm pretty sure about the winning architecture but well, I understand the "legacy" concern.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.