Percolation dead slow on EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :frowning:

To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API with multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000).

One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s. That would've been ok if one of the resources was utilized completely but that's the strange thing (see stats output here): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.

When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it).

To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes with the same result.

Finally, we also upped the processors configuration and restarted the node to fake our way to a higher usage of resources, to no avail. We tried tweaking the percolate pool size and queue but that also didn't make a bit of difference. I truly hope someone here has an answer to this :slight_smile:

I've got a few updates on things we tried:

Something I haven't mentioned before: we use Elasticsearch 2.2.0.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues.
When we looked at the hot threads, we discovered UsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue. No change.
Maybe it's the amount of replicas? We upped it to 3 and used more EC2 instances to spread them out. No change.

Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s. It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.

Anyone?

Not sure what is going on here, but normally when executing many percolate requests the cpu usage increases significantly.

Can you take a look at the took fields in the multi percolate response and check if the times add up with your measured timing? Each response item should have a took entry.

Do you have a dedicated percolator index or do the percolator queries are in the same indices as the rest of your documents?

If you're not using the filter or query option then the percolator will for each percolate request check query by query if the document being percolated matches. The execution is linear to the amount of queries being evaluated. Adding more primary shards and nodes allows to parallelize the percolation request (each shard evaluates its own queries).

Exactly our thinking so that's why we're a bit surprised to not see that pattern :slight_smile:

We tweaked the search, percolate and get thread pool more, uppping the size.

Under load the request time and took differ about 1 second. All took's are about the same (+/- 1ms), which makes sense, seeing that the structure of the queries in the percolator are practically the same. I think there was a reduction of this difference and that's because we added more clients (each indexing data node is now also a client) and upped the client thread pool size.

The percolators are in the same index as the percolated documents. Do you think it would improve if we separated them?

We are using the filter option to select the specific percolators we're interested in, coming in at a count of around 2200. In the queries in the percolator we do use filters, an example can be found in this gist. The queries might not be ideal (since they're assembled by code) but to me (unless I'm overlooking something) that wouldn't explain the non-fully utilized resources.

We're at 10 primary shards now, which would be ideal for later when we distribute them to the "query only" nodes. I would be willing to venture into more primary shards to speed this up but only if we see that any resource is fully utilized, which is sadly not the case :confounded:

We had a short look at routing but that doesn't seem viable at the moment (since we don't have a key that "guarantees" more or less evenly distributed shards, unless we'll just use a modulo of the primary key). Do you think it's worth while investigating this option further?

One last question: has anything on percolation changed from 2.1.0 to 2.2.0? I'm asking because last time we percolated successfully was on 2.1.0. When we changed to AWS we also updated ES.

What do you mean with this and how did you interact with your cluster before?

Yes, in general this can improve the percolator performance. Percolate requests and search requests use the system resources differently. Splitting the queries from the regular data allows to isolate the percolator requests. Also this gives the flexibility to have a different number of primary shards for the percolator index then the regular index.

However in this case this it is good to figure out what is causing this high latency. You're only using the multi percolate api? Also how many percolate requests are you bundling in a single multi percolate request and how many multi percolate requests are being executed concurrently?

I don't think so, because you're already using the filter option. (which is more flexible)

No, there are no changes.

Thanks for your elaborate reply Martijn.

Before we had an http client (http enabled: true) running on each data node. We wanted to separate that in the new AWS infrastructure to keep a closer eye on how much resources what type of task needed. For percolation we now reverted that decision, giving each percolator node also a client (http enabled: true).

Alright, we'll definitely try this then :slight_smile:

Yes, we're only hitting _mpercolate. We tried out 1000 requests per call but also 200. Of course the 1000 was slower but gave less overhead so throughput wise this is giving better results. We're now hitting the API with about 16 php parallel processes but we're also playing with this number.

It seems that, in the end, there was no silver bullet solution to our problem. We fiddled around with so much stuff and we couldn't get better resource utilization. We finally did something crazy and put 8 nodes on one big server (32 cores, 60GB, 2 x 160GB SSD). It worked!

Elasticsearch is pretty hardwired to not shoot yourself in the foot, which is fantastic normally but not in our case. We just want to bring a server to the brink, no buffer capacity needed for search because we know that it doesn't need it. With those 8 nodes on 1 server we could get 95% CPU and a load of around 30, which is exactly where we want to be. One note: diminishing returns strike hard :slight_smile: In one server there's obviously little overhead, with two the performance goes up about 50%, with three goes up another 30% or so. You can see where that's going. It's a shame that AWS doesn't provide bigger machines yet but there's still options on the network level to look at like enhanced networking and placement groups

We'll still also have a look at separating percolators and data and maybe routing is interesting as well. Thanks for all the help!