Dec 1st, 2021: [en] The impact of Elasticsearch source filtering on performance

:information_source: Ce billet est aussi disponible en français : Dec 1st, 2021: [fr] L'impact du filtrage de champs sur les performances d'Elasticsearch

A few years ago, I needed to retrieve many documents from Elasticsearch, and it was going to take one hour. When mentioning this to a colleague, he immediately asked me: "Have you filtered the source to only retrieve the fields you need?" I had not, and simply setting the _source parameter dramatically improved the performance of my queries.

Years later, is source filtering still important? If it is, what kind of speed ups can we expect? I don't know, and I'm writing this post to find out.

Paginated searches

Say you want to retrieve 25k documents from Elasticsearch. Retrieving 50k documents in one request could overload Elasticsearch, timeout or even be forbidden by a proxy. Instead, the best practice is performing multiple search queries, retrieving 1000 documents at a time by using the size and search_after parameters.

But between queries, the underlying data could have changed. This is where the point in time API comes in: it allows you to pretend that nothing changes between queries, as if time was frozen.

Rally

To benchmark Elasticsearch, we'll use Rally, which is carefully designed to make your benchmarks reliable while also being easy to use.

Rally is all about racing (benchmarking) cars (Elasticsearch clusters) around tracks (steps such as ingesting data or querying it). In this post, we'll use the official PMC track. Its documents look like this:

{
 "name": "3_Biotech_2015_Dec_13_5(6)_1007-1019",
 "journal": "3 Biotech",
 "date": "2015 Dec 13",
 "volume": "5(6)",
 "issue": "1007-1019",
 "accession": "PMC4624133",
 "timestamp": "2015-10-30 20:08:11",
 "pmid": "",
 "body": "\n==== Front\n3 Biotech3 Biotech3 Biotech2190-572X2190-5738Springer ..."
}

What's interesting about this track is that most of the data is in the body field, so not having to read it should help performance. After cloning the elastic/rally-tracks repository, I installed Rally, applied this patch and modified the PMC challenge to test paginated searches.

To do so, I first took inspiration from the Rally docs to add the following file as pmc/operations/pagination.json:

{
  "name": "search-after-with-pit-default",
  "operation-type": "composite",
  "requests": [
    {
      "stream": [
        {
          "operation-type": "open-point-in-time",
          "name": "open-pit",
          "index": "pmc"
        },
        {
          "operation-type": "paginated-search",
          "name": "paginate",
          "index": "pmc",
          "with-point-in-time-from": "open-pit",
          "pages": 25,
          "results-per-page": 1000,
          "body": {
            "sort": [
              {"timestamp": "desc"}
            ],
            "query": {
              "match_all": {}
            }
          }        
        },
        {
          "name": "close-pit",
          "operation-type": "close-point-in-time",
          "with-point-in-time-from": "open-pit"
        }
      ]
    }
  ]
}

Next, I needed to actually use this operation by referencing the operation in the append-no-concflict challenge which lies in the pmc/challenges/default.json file:

{
  "operation": "search-after-with-pit-default",
  "warmup-iterations": 10,
  "iterations": 20
}

I also removed all other operations after wait-until-merges-finish as we only care about paginated searches here. Here's the resulting commit.

Benchmark setup

To make sure the benchmarks are reproducible and only benchmark Elasticsearch itself, I took a few precautions.

  • The load driver (i3en.6xlarge) and the Elasticsearch node (m5d.4xlarge) are two different machines in the same data center, which avoids any client or networking bottlenecks.
  • Before running the benchmarks, I trimmed the SSDs and dropped the Linux file system cache.
  • I made sure that no updates was running in the background.

esrally race

If you're following at home, you can run the race like this, and wait for a few minutes:

$ esrally race --distribution-version=7.15.2 --track-path=~/src/rally-tracks/pmc

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [79992a48-57cd-41fb-9549-68f71354c5dd]
[INFO] Preparing for race ...
[INFO] Racing on track [pmc], challenge [append-no-conflicts] and car ['defaults'] with version [7.15.2].

Running put-settings                      [100% done]
Running delete-index                      [100% done]
Running create-index                      [100% done]
Running check-cluster-health              [100% done]
Running index-append                      [100% done]
Running refresh-after-index               [100% done]
Running force-merge                       [100% done]
Running refresh-after-force-merge         [100% done]
Running wait-until-merges-finish          [100% done]
Running search-after-with-pit-default     [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                    Metric |                          Task |       Value |   Unit |
|--------------------------:|------------------------------:|------------:|-------:|
...
|   50th percentile latency | search-after-with-pit-default |     3314.37 |     ms |
|   90th percentile latency | search-after-with-pit-default |     3366.62 |     ms |
|  100th percentile latency | search-after-with-pit-default |     3400.26 |     ms |
|                error rate | search-after-with-pit-default |           0 |      % |

Opening/closing the point-in-time and searching 25k documents usually took around 3.3 seconds.

Comparing _source settings

Now that we have this 3.3 seconds baseline in the case where we retrieve all the data (including the large body field) let's compare it to other options by changing the query in our pmc/operations/pagination.json file:

  1. Filtering everything out with _source: false,
  2. Only reading a single short field: journal,
  3. Only reading the long body field.

To read a single field, we use both the _source and fields methods. Here's a table showing median latency for each option:

Method Median latency
No source filtering 3.3s
"_source": false 1.1s
journal only with _source 1.3s
journal only with fields 1.3s
body only with _source 3.7s
body only with fields 4.8s

It makes sense that requesting no or little data is faster: in that case 3 times faster! This means we can indeed recommend filtering to improve paginated search performance.

However, requesting only body is slower than requesting the full document, both using _source and fields. When a field is taking the bulk of the document already, source filtering appears to be bad idea, and in this case using fields was even worse.

Recommendations

In light of this, here's what we can recommend in practice with paginated search:

  1. If you can afford to only look at a small part of the document, then enable source filtering. It cannot hurt as it transmits less data, requires less compression/decompression and less JSON encoding/decoding.
  2. If you need most or all of the document, then you'll have to profile your code. In the real world, the client can be much less powerful. In that case, compression and JSON parsing can become limiting factors: consider disabling compression and using a faster JSON parser (such as orjson in Python).

Future work

Despite my careful benchmarking setup, I still used AWS instances: running on bare metal servers as we do for the Elasticsearch benchmarks could help get even more precise and more reproducible results.

Another important thing to note is that the documents we were reading were small enough to be at all time cached in memory, so Elasticsearch did not have to read anything from disk after warmup. For larger datasets, reading from disk could become the bottleneck. At that point, source filtering could help Lucene reading from disk more efficiently thanks to the doc values optimization, but we would not know that without running more careful benchmarks!

And finally, as mentioned in recommendations in practice it's easy to introduce accidental bottlenecks: always profile your code with a sampling profiler to see what is taking time!

4 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.