Percolator performance ideas

I'm trying to optimize the performance of my percolators, and hoping folks out there have some ideas, as I can't find a lot out there. Percolator performance seems to be great until I hit ~100,000 queries, and then falls off. Unfortunately I need to be able to support 500,000-1,000,000 queries per index.

Queries are already pretty optimized. I'm using a sample query for now anyway.

I do use the query property to filter which queries get executed.

I tried turning off auto_expand_replica but it seemed to ignore me.

I manually create the _percolator index and set it to 2 shards and 0 replicas but it instantly sets replicas to 1. Can the _percolator index be shared like a regular index? And will the engine distribute the percolating document across shards like the search side of things?

Can Routing be used with percolate?

Any other ideas??

Thanks in advance!

TYPO: Can the _percolator index be sharded like a regular index?

On Friday, June 8, 2012 7:00:40 AM UTC-4, shadow000fire wrote:

I'm trying to optimize the performance of my percolators, and hoping folks
out there have some ideas, as I can't find a lot out there. Percolator
performance seems to be great until I hit ~100,000 queries, and then falls
off. Unfortunately I need to be able to support 500,000-1,000,000 queries
per index.

Queries are already pretty optimized. I'm using a sample query for now
anyway.

I do use the query property to filter which queries get executed.

I tried turning off auto_expand_replica but it seemed to ignore me.

I manually create the _percolator index and set it to 2 shards and 0
replicas but it instantly sets replicas to 1. Can the _percolator index be
shared like a regular index? And will the engine distribute the percolating
document across shards like the search side of things?

Can Routing be used with percolate?

Any other ideas??

Thanks in advance!

The design of the _percolator is such that it needs to exist on all nodes
storage wise. It loads the relevant queries when actual indices are
allocated. I have some ideas on how to improve that, but for now, you can
do the partitioning yourself by having some queries in one "index" (i.e.
/_percolator/index1) and another in another index (/_percolator/index2),
and execute the percolate request on both.

On Fri, Jun 8, 2012 at 1:00 PM, Jason Scheller jason.scheller@gmail.comwrote:

I'm trying to optimize the performance of my percolators, and hoping folks
out there have some ideas, as I can't find a lot out there. Percolator
performance seems to be great until I hit ~100,000 queries, and then falls
off. Unfortunately I need to be able to support 500,000-1,000,000 queries
per index.

Queries are already pretty optimized. I'm using a sample query for now
anyway.

I do use the query property to filter which queries get executed.

I tried turning off auto_expand_replica but it seemed to ignore me.

I manually create the _percolator index and set it to 2 shards and 0
replicas but it instantly sets replicas to 1. Can the _percolator index be
shared like a regular index? And will the engine distribute the percolating
document across shards like the search side of things?

Can Routing be used with percolate?

Any other ideas??

Thanks in advance!

Can you explain that please? What part of the percolator design requires
the queries to be replicated on every node? I was thinking it could work
similar to search: one node could build the index for the document, send it
to the other shards, each shard could execute its own local subset of the
index and return the matched IDs, the original node could aggregate the
responses and return.

Would you mind sharing your ideas (whatever part you have formulated so
far?)? I'd like to be able to plan accordingly, and might have some time to
help with implementation.

Just to clarify, you're saying that currently the _percolator index cannot
be sharded like other indices, correct? The percolate documentationhttp://www.elasticsearch.org/guide/reference/api/percolate.html(at least how I read it) makes it sound like it can be sharded by setting index.auto_expand_replica
to false. If not, what does that setting actually do?

Another feature that we might consider in the future is to allow multiple
_percolator indices. (ie. _percolator1, percolator2, custompercolator, etc)
so that the sharding (if possible) can be configured differently for each
(non-percolator) index.

On Monday, June 11, 2012 10:54:33 AM UTC-4, kimchy wrote:

The design of the _percolator is such that it needs to exist on all nodes
storage wise. It loads the relevant queries when actual indices are
allocated. I have some ideas on how to improve that, but for now, you can
do the partitioning yourself by having some queries in one "index" (i.e.
/_percolator/index1) and another in another index (/_percolator/index2),
and execute the percolate request on both.

On Fri, Jun 8, 2012 at 1:00 PM, shadow000fire wrote:

I'm trying to optimize the performance of my percolators, and hoping
folks out there have some ideas, as I can't find a lot out there.
Percolator performance seems to be great until I hit ~100,000 queries, and
then falls off. Unfortunately I need to be able to support
500,000-1,000,000 queries per index.

Queries are already pretty optimized. I'm using a sample query for now
anyway.

I do use the query property to filter which queries get executed.

I tried turning off auto_expand_replica but it seemed to ignore me.

I manually create the _percolator index and set it to 2 shards and 0
replicas but it instantly sets replicas to 1. Can the _percolator index be
shared like a regular index? And will the engine distribute the percolating
document across shards like the search side of things?

Can Routing be used with percolate?

Any other ideas??

Thanks in advance!

The _percolator index needs to be shared among the nodes, and each node
loads the relevant data (relevant index is allocated on that node). The
design loads all the queries for a relevant index once a shard of it is
allocated on that node. Then, when percolating against that index, we can
send the percolate request to one of the nodes where a shard of the
respective index is allocated, and percolate it there. Because percolation
process is very fast, usually, there is no need to distribute it. Note, the
_percolator index just acts as data storage, nothing more.

One way to have it distributed is to have percolator queries registered
spread across several indices, and do the percolation in parallel against
them.

The ideas for enhancements are several:

  1. Have the ability to execute the percolate request against multiple
    indices, and parallelize based on that.
  2. Allow to specify a parallelism level when percolating, and we can
    distribute the percolate request based on that, and it will execute only on
    subsets of the data on each shard.
  3. Have a /{index_name}/_percolate storage (or something similar) where
    percolated queries will be stored in a sharded mode, and each percolator
    request will be distributed across the shards (similar to search).

-shay.banon

On Mon, Jun 11, 2012 at 6:59 PM, shadow000fire jason.scheller@gmail.comwrote:

Can you explain that please? What part of the percolator design requires
the queries to be replicated on every node? I was thinking it could work
similar to search: one node could build the index for the document, send it
to the other shards, each shard could execute its own local subset of the
index and return the matched IDs, the original node could aggregate the
responses and return.

Would you mind sharing your ideas (whatever part you have formulated so
far?)? I'd like to be able to plan accordingly, and might have some time to
help with implementation.

Just to clarify, you're saying that currently the _percolator index cannot
be sharded like other indices, correct? The percolate documentationhttp://www.elasticsearch.org/guide/reference/api/percolate.html(at least how I read it) makes it sound like it can be sharded by setting index.auto_expand_replica
to false. If not, what does that setting actually do?

Another feature that we might consider in the future is to allow multiple
_percolator indices. (ie. _percolator1, percolator2, custompercolator, etc)
so that the sharding (if possible) can be configured differently for each
(non-percolator) index.

On Monday, June 11, 2012 10:54:33 AM UTC-4, kimchy wrote:

The design of the _percolator is such that it needs to exist on all nodes
storage wise. It loads the relevant queries when actual indices are
allocated. I have some ideas on how to improve that, but for now, you can
do the partitioning yourself by having some queries in one "index" (i.e.
/_percolator/index1) and another in another index (/_percolator/index2),
and execute the percolate request on both.

On Fri, Jun 8, 2012 at 1:00 PM, shadow000fire wrote:

I'm trying to optimize the performance of my percolators, and hoping
folks out there have some ideas, as I can't find a lot out there.
Percolator performance seems to be great until I hit ~100,000 queries, and
then falls off. Unfortunately I need to be able to support
500,000-1,000,000 queries per index.

Queries are already pretty optimized. I'm using a sample query for now
anyway.

I do use the query property to filter which queries get executed.

I tried turning off auto_expand_replica but it seemed to ignore me.

I manually create the _percolator index and set it to 2 shards and 0
replicas but it instantly sets replicas to 1. Can the _percolator index be
shared like a regular index? And will the engine distribute the percolating
document across shards like the search side of things?

Can Routing be used with percolate?

Any other ideas??

Thanks in advance!

By any chance did any of these improvements get implemented since last year when they were mentioned here?

The _percolator index needs to be shared among the nodes, and each node loads the relevant data (relevant index is allocated on that node). The design loads all the queries for a relevant index once a shard of it is allocated on that node. Then, when percolating against that index, we can send the percolate request to one of the nodes where a shard of the respective index is allocated, and percolate it there. Because percolation process is very fast, usually, there is no need to distribute it. Note, the _percolator index just acts as data storage, nothing more.

One way to have it distributed is to have percolator queries registered
spread across several indices, and do the percolation in parallel against
them.

The ideas for enhancements are several:

  1. Have the ability to execute the percolate request against multiple
    indices, and parallelize based on that.
  2. Allow to specify a parallelism level when percolating, and we can
    distribute the percolate request based on that, and it will execute only on
    subsets of the data on each shard.
  3. Have a /{index_name}/_percolate storage (or something similar) where
    percolated queries will be stored in a sharded mode, and each percolator
    request will be distributed across the shards (similar to search).

-shay.banon