One ES container with n search threads is slower than two containers with n/2 search threads

I'm running a benchmark to optimize for search throughput (i.e. queries/second). The query I'm profiling is a custom query in an Elasticsearch plugin. The query is very similar to the Terms Set Query.

I tried one setup where I had one ES container with 8GB heap and 8 search threads on my 12-core Ubuntu VM. I tried another setup where I had two ES containers, each with 4GB heap and 4 search threads on the same 12-core Ubuntu VM.

I was surprised to find that the setup with 2 containers was about 30% faster. I also tried the single-container setup with 4GB heap and the two-container setup was still faster. My assumption going in was that they would be about the same, and maybe the single-container setup would be a bit faster due to having no network communication between ES containers.

Any ideas why this could be?

A few possibilities:

  • searches also use transport threads, both for the HTTP work and for the inter-node communication. Did you have more transport threads available, in aggregate, in the 2-node case?
  • GC will be happening too; maybe there's more resources available to do GC, in aggregate, with 2 JVMs running.
  • searches also do IO; maybe the two-node setup is presenting a different IO profile to your disks that they can handle better for some reason.
  • the experiment is confounded by something else; for instance, perhaps the limiting factor is the client, which achieves better throughput to multiple addresses for some reason. Check out https://www.elastic.co/blog/seven-tips-for-better-elasticsearch-benchmarks for more along these lines.
  • does this only happen when using your custom plugin? If so, that suggests the plugin is at fault.
  • it's possible there's a lock somewhere that sees more heavy contention in a one-node setup. If so, we might consider that a bug.

Have you run any low-level profiling, e.g. JFR or the newer async profiler? Might be worth doing so. If it yields anything interesting or mysterious then please share your findings here.

Thanks @DavidTurner

I'm thinking this is the most likely culprit. When I profiled in VisualVM, the threads spend a lot of time in the "parked" state. When I profiled in JProfiler, they spend a lot of time in the "waiting" state. So it seems like there's some limit imposed by a single JVM on parallel IO. Can you recommend any tools for profiling this? VisualVM and JProfiler just tell me that the threads are parked/waiting.

Hmm it depends exactly what it means by PARKED or WAITING, and the exact stack traces you're seeing. In the context of a JVM thread those terms tend to mean that the thread is waiting on a lock. In Elasticsearch that's usually threads in a threadpool with no work to do. Threads waiting on IO are typically called something like RUNNABLE, since the blocking is happening in the kernel and not in userspace.

I don't think this is the case by the way, IO done by the JVM feeds through to the underlying syscalls pretty directly, so this would have to be a limit on a single process performing IO that the kernel imposes.

JFR and async profiler (both linked in my earlier post) should both be able to tell you where the threads are parked/waiting.

Ah that's good to know. I guess I was assuming that parked/waiting meant blocked on IO.

There are a couple other possible bottlenecks that I'd like to investigate. One of them is that some shards might contain many more matches for a given query, causing others to finish quickly and thus causing threads to sit and wait. Is it safe to assume that if I have n shards and n search threads, and am making 1 search request at a time, ES should be smart enough to schedule one shard per thread?

If that doesn't yield anything useful, I'll try to create a reproducible setup for this issue.

The actual behaviour isn't quite that strict, but it achieves the same effect. The shard-level tasks are added to a queue which is consumed by the n search threads, so although there's no guarantee that each thread runs exactly one task we do know that there won't be tasks waiting in the queue while there are idle workers.

The story is different for transport threads: the relationship between transport connections and threads is many-to-one not many-to-many. I don't expect this to matter, we do almost nothing on the transport thread before handing the work off elsewhere, but I mention it for the sake of completeness. If this experiment were bottlenecked on the transport threads then we'd probably consider it a bug.

Thanks for the clarification. I just checked a couple things using some printlns and greps. The number of docs each query finds in each shard is uniform, and each of the search threads is used exactly once per query.

Here's a related issue I posted over on the github repo where the plugin is implemented: https://github.com/alexklibisz/elastiknn/issues/220

At this point my best guess is: The search threads are waiting while other threads parse the JSON query, combine and re-rank the results from each shard, and serialize the JSON results. If the combining and re-ranking happen once per node, and thus can happen in parallel, then this might explain why the two-node queries are faster. This would also explain why I continue to see a speedup while running more parallel search requests.

It would, but I don't think there's anything that happens once per node involved in a search request. Parsing the input JSON, collecting the shard-by-shard results, and serialising the result -- these things all happen only on the coordinating node (i.e. the node that received that original HTTP request).

The image you shared in that link looks weird indeed, there are periods in which all the search threads are simultaneously red. Pretty sure JFR will give you the information you need to dig deeper there.

I ran some flight recordings with JDK Mission Control, but I don't see any new or useful information there.

The analysis basically shows that there's a method which gets called once on every document. That's to be expected, and ideally it would be parallelized such that n shards are n times faster than 1 shard. But that's not the case. 8 shards are only about 3x faster than 1 shard.

The threads view is a bit confusing. The search threads don't show any "waiting" or "parked" state like VisualVM and Jprofiler both showed. There's also no obvious sign of locking, sleeping, etc.

Here's a repro using very simple Elasticsearch term queries with 7.9.3 containers: https://github.com/alexklibisz/elastiknn/blob/shards-question/examples/parallel-shards-question/benchmark.ipynb

To summarize, the numbers look like this:

  • 1 master node, 1 data node, 1 shard: 188 queries/second
  • 1 master node, 1 data node, 8 shards: 220 queries/second
  • 1 master node, 2 data nodes, 8 shards (4 per node): 258 queries/second

The notebook should make this reproducible. To get the code: git clone git@github.com:alexklibisz/elastiknn.get, git checkout shards-question, cd examples/parallel-shards-question.

The docker-compose configuration is missing from your description of how to reproduce this.

In particular, in your OP you said that in the 2-node case you restricted each node to only have 4 search threads, but it is not clear how you're applying that restriction.

AFAICT these tests are completely bottlenecked on the client (and request-level coordination to some extent) which is one of the deadly sins of benchmarking I already mentioned. Elasticsearch has a lot of threads in PARKED state simply because the client isn't giving them any work to do. You're suffering from Amdahl's law: the workload is mostly sequential, so increasing the width of the parallel part doesn't help much.

Running this on your development laptop is another of those deadly sins. I tried your code on a more realistic production machine and observed essentially no difference whether the index was restricted to a single node or was spread across two:

(discuss-256965) davidturner@worker-772973:~/discuss/256965$ python
Python 3.8.1 (default, Nov 30 2020, 09:30:32)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import test
>>> test.search('ix-8', n=10000)
Running 10000 queries on index ix-8: 100%|██████████████████████████████████████████████████████| 10000/10000 [00:39<00:00, 250.89it/s]
>>> test.search('ix-8-onenode', n=10000)
Running 10000 queries on index ix-8-onenode: 100%|██████████████████████████████████████████████| 10000/10000 [00:40<00:00, 247.73it/s]
>>>
(discuss-256965) davidturner@worker-772973:~/discuss/256965$ curl -s 'http://localhost:9200/_cat/shards'  | sort
ix-1         0 p STARTED 278677 96.7mb 127.0.0.1 node-2
ix-8         0 p STARTED  35010 12.6mb 127.0.0.1 node-1
ix-8         1 p STARTED  35182 12.5mb 127.0.0.1 node-1
ix-8         2 p STARTED  35043 12.3mb 127.0.0.1 node-1
ix-8         3 p STARTED  34624 12.4mb 127.0.0.1 node-2
ix-8         4 p STARTED  34611 12.3mb 127.0.0.1 node-1
ix-8         5 p STARTED  34738 12.4mb 127.0.0.1 node-2
ix-8         6 p STARTED  34722 12.4mb 127.0.0.1 node-2
ix-8         7 p STARTED  34747 12.3mb 127.0.0.1 node-2
ix-8-onenode 0 p STARTED  35010 12.6mb 127.0.0.1 node-1
ix-8-onenode 1 p STARTED  35182 12.5mb 127.0.0.1 node-1
ix-8-onenode 2 p STARTED  35043 12.3mb 127.0.0.1 node-1
ix-8-onenode 3 p STARTED  34624 12.4mb 127.0.0.1 node-1
ix-8-onenode 4 p STARTED  34611 12.3mb 127.0.0.1 node-1
ix-8-onenode 5 p STARTED  34738 12.4mb 127.0.0.1 node-1
ix-8-onenode 6 p STARTED  34722 12.4mb 127.0.0.1 node-1
ix-8-onenode 7 p STARTED  34747 12.3mb 127.0.0.1 node-1

It's best not to "roll your own" when it comes to benchmarking, it's just too easy to fall into traps like this. We recommend Rally instead.

Also FWIW there is no IO involved in this test: the indices are so small that they will be 100% in filesystem cache.

1 Like

@DavidTurner Thanks for looking this over so thoroughly.

The docker-compose configuration is missing from your description of how to reproduce this. In particular, in your OP you said that in the 2-node case you restricted each node to only have 4 search threads, but it is not clear how you're applying that restriction.

Sorry about forgetting the docker-compose.yml. Here it is.

In the benchmark I just kept 8 search threads for the 2-node case. I figured there's no harm in that since there are only 4 shards that the search threads could be operating on, hence max 4 threads running at any given time.

Elasticsearch has a lot of threads in PARKED state simply because the client isn't giving them any work to do. You're suffering from Amdahl's law: the workload is mostly sequential, so increasing the width of the parallel part doesn't help much.

Ok, so it seems like the "hunch" I described in my notebook is at least part of the explanation.

Running this on your development laptop is another of those deadly sins. I tried your code on a more realistic production machine and observed essentially no difference whether the index was restricted to a single node or was spread across two:

I just tried it out on an EC2 c5.4xlarge and got these results:

  • 1 master node, 1 data node, 1 shard: 258 queries/second
  • 1 master node, 1 data node, 8 shards: 274 queries/second
  • 1 master node, 2 data nodes, 8 shards (4 per node): 318 queries/second

Here is the updated notebook

I tried it again with 1M docs on the c5.4xlarge:

  • 1 master node, 1 data node, 1 shard: 89 queries/second
  • 1 master node, 1 data node, 8 shards: 132 queries/second
  • 1 master node, 2 data nodes, 8 shards (4 per node): 157 queries/second

Here is the updated notebook

Maybe there's something peculiar about my docker-compose setup?

It's best not to "roll your own" when it comes to benchmarking, it's just too easy to fall into traps like this. We recommend Rally instead.

FWIW, I installed esrally and spent about an hour trying to figure out how to make it use my local Elasticsearch cluster. It somehow kept spinning up its own ES, which I guess is a nice default in general, but got in the way here. I'd appreciate any pointers for how I can check this specific behavior with Rally.

@alexklibisz it looks like you already have Rally installed, excellent. Curious to see if you can reproduce with those benchmarks. You can tell Rally not to provision a cluster by using --pipeline=benchmark-only and --target-hosts=<your-host>:<your-port> in the command.

More info:

1 Like

I'll try this out. Any suggested datasets/tracks to use here?

Alright, I ran it using this command:

esrally --track=pmc --kill-running-processes --pipeline=benchmark-only --challenge=append-no-conflicts --target-hosts=localhost:9200

I picked PMC because that sounded like the most similar to my use-case, which is basically just simple term matching. I also made sure the two node run actually split the five shards across two nodes, and it does.

Here are the results for what I assume are standard term queries of some sort:

One node:

|      50th percentile latency | term | 5.05743 | ms |
|      90th percentile latency | term | 5.46878 | ms |
|      99th percentile latency | term | 5.79232 | ms |
|     100th percentile latency | term | 8.71734 | ms |
| 50th percentile service time | term | 4.33287 | ms |
| 90th percentile service time | term | 4.53404 | ms |
| 99th percentile service time | term | 4.86713 | ms |
|100th percentile service time | term | 7.58252 | ms |

Two nodes:

|      50th percentile latency | term |  4.9212 | ms |
|      90th percentile latency | term | 5.41786 | ms |
|      99th percentile latency | term | 5.92577 | ms |
|     100th percentile latency | term | 7.14734 | ms |
| 50th percentile service time | term | 4.17579 | ms |
| 90th percentile service time | term |  4.5802 | ms |
| 99th percentile service time | term | 4.98141 | ms |
|100th percentile service time | term | 6.30278 | ms |

Only mildly noteworthy thing is that the the 100th ptiles are lower by over 1ms for the two node case. I guess once you're dealing with fractions of milliseconds it's tough to draw any conclusion from two such numbers. I'm also not sure how many queries rally is running here.

Here are the full results: https://gist.github.com/alexklibisz/c45259a9b1e8d394b00b4e5a968d7e02

Let me know if you see any obvious flaws in the way I set this up. I think I'm not totally satisfied with the conclusion here. There's clearly something strange causing the 1 node vs. 2 node differences in the very simple benchmarks I posted which also aligns with my own application. But I don't see much else to try.

If you still want to eke out a comparative case, I have some suggestions. Typically these benchmarks are run against smaller nodes for our purposes, so the overall throughput target is not very high, but they can be customized for your case if need be.

If you look at this line in rally-tracks you will see we target 20 for the throughput, which is what is accomplished in both cases according to your results. You may want to change this value to further exercise the two configurations and try to detect a performance difference.

The high level steps to do that are
a) clone the rally-tracks repository, linked above
b) change the value in that file
c) run esrally with the --track-path parameter (doc) pointing to <...>/rally-tracks/pmc

Some other docs you might find useful for your use case:

1 Like

@RickBoyd Thanks. I'll try to look into these options over the weekend.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.