Iterating through large datasets in various orders efficiently


I'm extremely new to the world of elasticsearch, but even after reading a bit of the material, I'm still quite unsure how elasticsearch actually works under the hood.

I understand that all of it essentially revolves around inverted indices, it must also probably involve some internal statistics.

If I have a query with two filters, the engine will have two inverted indices to choose from, it will likely pick the one with a lower number of matching docs, and it will simply "filter out" in software the other one.

I also understand that, even though "compound" inverted indices don't exist, we can kind of "make them up" using the "copy_to" feature, to create a compound field, and then filter out on that field.

Now what I really don't understand however, is the sorting.
From what I understand, the main way data is sorted is in memory, after the entries have actually been fetched.
This is fine if we're querying a fairly small set of data,
e.g. "get all apples ordered by price ascending" is good!
"get all products ordered by price ascending" is worse.

To solve this, there's the "sorted index" feature, which will (provided we don't care about number of matches) essentially only read as many entries as it returns (given a single shard). That is, it won't read all potential 10M entries to return a page of 10.
This also synergizes well with the "search_after" feature, given the index is sorted, it's easier for elasticsearch to actually lookup the start of the data it needs to return.

But what happens if I want to query sorting under a different key?
Should I create a second sorted index, and insert all my entries twice? Is this a common use-case?

In Elasticsearch all fields are indexed individually and at query time the indices of multiple fields will be used, not just a single one. In the example you mentioned both indices will be queried and the sets of matches will be combined based on the logic of the query. The restriction that only one index can be used, which exists in some databases, does not apply to Elasticsearch.

There is no need for this.

The shards return information about the matches they have identified to the node coordinating the query, which I believe is enough to score and sort. I believe only the documents that are to be returned are fetched from disk.

If I recall correctly (have not used this feature much), the sorted index feature is a very specialised feature that optimises data on disk if you have a specific sort order that you always use, and can cut latencies this way. The trade-off is that it adds overhead at indexing time and does not help for any other sort order. This is therefore something that you only typically use for very specific scenarios.

This is in no way required for search after functionality.

Sorting in different ways is standard and efficient in Elasticsearch and does not require any special handling. Just use a standard index.

Thanks for the answers!

I don't understand how that can work, wouldn't it just end up reading the same docs multiple times from disk? Or are they both read in parallel, and merged in memory? Or maybe each node is responsible for picking the index it reads?

I understand that it's not "needed", but according to the docs it should make it a lot more efficient (cf copy_to | Elasticsearch Guide [8.14] | Elastic)

So the documents themselves might not be fetched, but at least metadata on all of the matching entries will, no?

I understand that nothing is "required", but some things will be more optimized than others.
If the data is sorted on disk, "search_after" could (and I hope would) directly look ahead and read in a sequential manner, stopping after N rows. If it isn't, data needs to be sorted first, only then will search_after apply.

It can certainly search and sort over arbitrary queries, but some will be more optimal than others. I'm trying to identify ways to optimize multiple sort patterns. So far the only thing that came to my mind was having multiple indices, but that doesn't feel right at all... It's certainly not required for the feature to work, but it doesn't strike me as "efficient" either.

Although this guide is very old and there has been a lot of improvements to Elasticsearch since then I believe the principles outlined in this guide are still valid.

This section of that book is also useful, but I would probably recommend you read the full chapter the sections I linked to is coming from.

Do you have a real life performance problem you are trying to solve or are you perhaps trying to optimise for something that in reality may not be an issue at all?

Thanks, I don't know how I missed the book, I think the big yellow warning in the header put me off the first time I saw it and I never went back to it...
I think I understand the basic operations a lot better.
At least with regards to how non-scoring filters are computed and filtered.

Well, I do, but they're on an SQL server :slight_smile:
Since my data is extremely well structured, and I know exactly what kinds of queries I want to allow/disallow. This means that I also know the optimal ways to index/retrieve it.
However, it's a bit painful to write in SQL, and the ingestion itself is a bit slow. Optimal access patterns also require a ton of duplication (magnifying the number of entries a hundredfold...).

So I'm looking at alternatives. I know ES works well (we already use it for log ingestion), but I'm essentially trying to evaluate the delta between it and how I can tune it to as close to the "ideal patterns" that I know.
It doesn't mean that's what I'll end up doing, but knowing the internals let me know what's 100% fine and what could break.

Given the book (and the source code) I think I can manage to satiate my curiosity!