Percolator performance

Hello,

I'm currently writing an application that heavily uses ES percolator. To give an idea about what I'm trying to do:

I'm percolating an average of 20k new documents through a job that runs every 2 mins, each document is 5 KB on average, against 10k+ queries. The goal is to have this job finish executing in less than 2 minutes, but now (with the following cluster setup) it's taking not less than 15 minutes to finish.

I have a dedicated cluster for the percolator index to live, away from the main data cluster, with 6 c3.xlarge nodes, with 10 shards and replica 1.

Any ideas/suggestions for improving the current performance? should adding more nodes solve the problem?

Thanks!

What sort of load does that put your system under?

This just in: https://www.elastic.co/blog/when-and-how-to-percolate-2

1 Like

UPDATE: I've increased number of nodes to 10 now.

CPU load, but that's when I was running with 6 nodes. Now, after adding extra 4 nodes, there is no load on the CPU, but the time did not change!!!

Thanks for the link.

Well, the problem with my percolator queries, is that they actually do not have a high cardinality filter(s) that can be used for routing and/or filtering. So I end up percolating my documents against all the queries I have.

Note: I've already added a basic filter, but as I said, I know it's not of high cardinality.

Are you using the multi-percolate API [1] to batch your requests?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-percolate.html#_multi_percolate_api

I tried to use it before actually, with different batches sizes (50, 75 and 100 documents) and the performance was much worse...a single mpercolate request could take up to 3 minutes! I have a timeout of 1 minute, that I had to update it to 5 minutes for this to work to avoid timeout exceptions, then I decided to go away from it.

You mention your currently percolating 20,000 documents in 15 min or 22/sec.
Assuming your executing these sequentially, this doesn't represent terrible performance per query < 50ms. However, i appreciate your need to improve throughput.

I'm surprised by your m_percolare experience -this should really just be a network optimisation and doesn't increase parallelism on a node.

Some questions:

  1. You increased the number of nodes. Did you increase the number of shards or the number of replicas? How are you utilising this additional capacity? The only way you will improve performance here is to increase the shard count (re-index needed). The execution on each respective shard should in turn be faster.

  2. Are you executing these queries in parallel? If with the new nodes the CPU usage is low, try splitting into N batches and running N query threads in parallel. Given you have a replica of 1, N =2 would be a starting point.

Thanks for your reply.

To answer your questions:

  1. I have 10 shards, with replica of 1, this makes it 20 shards, 2 live on each node. Do I need to increase the number of shards to be more than 10?

  2. No, sequentially...depending on ES which (as far as I understand) runs them over shards in parallel.

The document will be sent to every shard for percolation. This will be done in parallel and the results collected before being returned. You then percolate another document and so forth. Execution on each shard is single threaded so per node you'll only be using a single CPU. Given you have replicas, and thus spare capacity, run try running two percolate threads at once - so you are always percolating two documents at any time.

This should approximately double your throughput and CPU load. Please confirm with testing per usual.

Adding alittle more detail. Currently a percolate request seems to be executing in 50ms (please correct me if this is wrong), with your throughput around 20 doc/sec.

In order to execute 20,000 in 2mins, you need to be percolating around 165 docs/sec. So you need a throughput improvement of 8 times.

You haven't given quantitive measures of CPU usage - your testing needs to be more structured and methodical for me to help in more detail. However, you have two strategies to try and achieve this.

  1. Decrease the latency of a single request, thus improving overall throughput.
  2. Increase the number of parallel percolation executions.

(1) can be achieved either using filtering (which apparently your can't do) or greater sharding. More shards means less docs per shard - the relationship between doc count and execution time per shard being linear. More shards means greater parallelism and in turn greater CPU usage.

(2) - this will utilise your replica shards and also improve paralleism - again at the expense of CPU usage.

Measure your CPU usage on your current shard count and the latency for a single request. Increasing the shards to 20 should approximately half your latency (if you have enough CPU!). You need to find the balance between shard count and parallel executions.

Thanks for your detailed reply.

Yes, the average percolate request time to execute is 50 ms. I actually started with number 1 and increased the number of percolator index shards from 20 to 40 (with replica set to 1), so a total of 80 shards distributed over 10 nodes (8 shards living on each node), but the weird thing is that did not improve the average percolation time (it stayed at 50 ms), and the CPU went a little up from 7-8% to 13-14% I thought the increase in CPU usage would be more actually, that's why I think it's weird.

I believe this should be limited by CPU, so the fact that you see limited CPU utilisation probably means you need to do more work in parallel. As you indices should be quite small, why not try setting the number of shards to the number of CPU cores per node and increase the replication factor so that all nodes hold a copy of all shards. This should allow you to run a much larger number of requests in parallel, spread across all nodes, as every node has all the data to complete the request. You could therefore also try to see if using _local preference helps increase throughput.

There will be a point you achieve no appreciable return from further sharding - quite quickly if your number of queries is very low in the first place. You can't just keeping doubling shards and expecting half the response time. 50ms may just represent as close to optimum as your likely to achieve - much of this cost is possibly just indexing the doc and setting up the percolate process. As Christian points this will all now be CPU bound. Increase the number of replicas and parallelise your requests. Consider also load balancing your requests across the nodes - every doc has to go to every node, but this spreads the load of result aggegration.

Once each node holds all data, you might even be able to benchmark using a single node and determine how throughput depends on the number of parallel requests, and then scale out to meet your throughput requirements.

Thanks for the useful insights.

I've already increased the number of replicas to 10 so that I have all data on each node, and yes as expected, CPU usage went up to 30-33%, but surprisingly the average percolate time per document went up to 150 ms! (with the same number of documents and queries).

Hello,

Last week I've also tested the percolate performance. In our case, for one percolate query, the throughput was about 1K queries/min, and for mpercolate with about 100 queries/call, the throughput was about 2,5K/min. Our cluster don't use any replica.
The queries were very simple, basically just a term query.
It was better if there was an official benchmark, or some official numbers so see if these numbers can be improved.
Sorry for this offtopic, but maybe it helps you.

Thanks for your input.

+1