Percolate -- Routing

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most logical
mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as saying
"the number of target index replicas will increase the performance"

Should that not read "...target index primaries..."? Or am I confused?

-Adam

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most logical
mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as saying
"the number of target index replicas will increase the performance"

Should that not read "...target index primaries..."? Or am I confused?

-Adam

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Are you sure about that, Taras?

I'm pretty sure that the shards are only used for determining which node
the _percolate runs on. Once a node is determined an on-the-fly Lucene
Index is created in memory -- completely independent of any shards -- for
the to-percolate document. All the queries in the _percolator index then
get dumped over that Lucene index, and the results are returned.

I was under the impression that the kind of parallel work that I assume
you're implying does not happen. In other words, there's no way to get
_percolate to check half your queries on one node and half on another.

Is the above not the case?

On Wed, May 1, 2013 at 3:20 PM, Taras Shkvarchuk tarass@gmail.com wrote:

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most logical
mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as saying
"the number of target index replicas will increase the performance"

Should that not read "...target index primaries..."? Or am I confused?

-Adam

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Now that I am re-reading the documentation, I'm starting to doubt my
understanding due to the mentions of round-robin. It might just be that
throughput is increased by adding mode nodes. If someone who is sure
doesn't reply today, I'll dig in the source for the answer.

On Wednesday, May 1, 2013 12:58:53 PM UTC-7, Adam Georgiou wrote:

Are you sure about that, Taras?

I'm pretty sure that the shards are only used for determining which node
the _percolate runs on. Once a node is determined an on-the-fly Lucene
Index is created in memory -- completely independent of any shards -- for
the to-percolate document. All the queries in the _percolator index then
get dumped over that Lucene index, and the results are returned.

I was under the impression that the kind of parallel work that I assume
you're implying does not happen. In other words, there's no way to get
_percolate to check half your queries on one node and half on another.

Is the above not the case?

On Wed, May 1, 2013 at 3:20 PM, Taras Shkvarchuk <tar...@gmail.com<javascript:>

wrote:

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most
logical mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as
saying "the number of target index replicas will increase the
performance"

Should that not read "...target index primaries..."? Or am I confused?

-Adam

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--

adamgeorgiou.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We have a 40-node percolate cluster fronted with a load balancer. The
percolate load is spread across all nodes in the cluster and if the load
increases, we just add more nodes to the cluster. Even though there are
only 3 primaries, the nodes with the primaries do not get more load than
the others.

As to parallelizing the queries in the _percolator, I can't answer that. I
think you'd have to get the ES devs to respond.

...Ken

On Wed, May 1, 2013 at 4:20 PM, Taras Shkvarchuk tarass@gmail.com wrote:

Now that I am re-reading the documentation, I'm starting to doubt my
understanding due to the mentions of round-robin. It might just be that
throughput is increased by adding mode nodes. If someone who is sure
doesn't reply today, I'll dig in the source for the answer.

On Wednesday, May 1, 2013 12:58:53 PM UTC-7, Adam Georgiou wrote:

Are you sure about that, Taras?

I'm pretty sure that the shards are only used for determining which node
the _percolate runs on. Once a node is determined an on-the-fly Lucene
Index is created in memory -- completely independent of any shards -- for
the to-percolate document. All the queries in the _percolator index then
get dumped over that Lucene index, and the results are returned.

I was under the impression that the kind of parallel work that I assume
you're implying does not happen. In other words, there's no way to get
_percolate to check half your queries on one node and half on another.

Is the above not the case?

On Wed, May 1, 2013 at 3:20 PM, Taras Shkvarchuk tar...@gmail.comwrote:

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most
logical mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as
saying "the number of target index replicas will increase the
performance"

Should that not read "...target index primaries..."? Or am I
confused?

-Adam

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**MDhjtNO1BsI/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

adamgeorgiou.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Very interesting, Ken. It sounds like that answers my question in part.
Namely, if your nodes with the primaries are not getting more load than
the nodes without primaries, then it would follow that shard-routing has
little to do with where _peroclate executes. That makes sense too, since
the _percolator index is replicated to all nodes, theoretically (as far
as I understand) giving every node the potential to run a _percolate job.

Thanks a lot for clearing that up!

I'm now left with two questions:

  1. Is there any validity to the claim that more replicas equate to
    better percolate performance? As far as I understand there's not, but I
    have documentation that says otherwise. I'm hoping for confirmation that
    the documentation is wrong.
  2. How does the cluster know which node to execute the _percolate job
    on? Is it always the node that receives the request? Or are similar
    distribution techniques in place with _percolate that are in place with
    _search?

Thanks again for reading,
-Adam

On Thu, May 2, 2013 at 7:34 AM, Kenneth Loafman kenneth@loafman.com wrote:

We have a 40-node percolate cluster fronted with a load balancer. The
percolate load is spread across all nodes in the cluster and if the load
increases, we just add more nodes to the cluster. Even though there are
only 3 primaries, the nodes with the primaries do not get more load than
the others.

As to parallelizing the queries in the _percolator, I can't answer that.
I think you'd have to get the ES devs to respond.

...Ken

On Wed, May 1, 2013 at 4:20 PM, Taras Shkvarchuk tarass@gmail.com wrote:

Now that I am re-reading the documentation, I'm starting to doubt my
understanding due to the mentions of round-robin. It might just be that
throughput is increased by adding mode nodes. If someone who is sure
doesn't reply today, I'll dig in the source for the answer.

On Wednesday, May 1, 2013 12:58:53 PM UTC-7, Adam Georgiou wrote:

Are you sure about that, Taras?

I'm pretty sure that the shards are only used for determining which node
the _percolate runs on. Once a node is determined an on-the-fly Lucene
Index is created in memory -- completely independent of any shards -- for
the to-percolate document. All the queries in the _percolator index then
get dumped over that Lucene index, and the results are returned.

I was under the impression that the kind of parallel work that I assume
you're implying does not happen. In other words, there's no way to get
_percolate to check half your queries on one node and half on another.

Is the above not the case?

On Wed, May 1, 2013 at 3:20 PM, Taras Shkvarchuk tar...@gmail.comwrote:

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most
logical mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as
saying "the number of target index replicas will increase the
performance"

Should that not read "...target index primaries..."? Or am I
confused?

-Adam

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**MDhjtNO1BsI/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

adamgeorgiou.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Did you ever get a definitive answer on this?

I am curious as well where the _percolate runs.
The docs state, under "How it Works"
(Elasticsearch Platform — Find real-time answers at scale | Elastic):

"The percolate API uses the whole number of shards as percolating
processing “engines”, both primaries and replicas. In our above case, if
the test index has 2 shards with 1 replica, 4 shards will round-robin in
handling percolate requests. Increasing (dynamically) the number of
replicas will increase the number of percolating processing “engines” and
thus the percolation power."

They go on to say:

"Note, percolate requests will prefer to be executed locally, and will not
try and round-robin across shards if a shard exists locally on a node that
received a request (for example, from HTTP). It’s important to do some
round-robin in the client code among nodes (in any case its recommended).
If this behavior is not desired, the prefer_local parameter can be set to
false to disable it."

Which I read to mean: The node that takes the request, assuming the shard
exists on that node, will handle the percolation. So they recommend the
client code do the round-robin to the additional nodes/replicas. Or in
other words, if you toss a load balancer in front of the nodes and send
your percolation request to the LB, your percolation requests SHOULD be
distributed across your cluster.

If I understand this wrong, please correct me. I am definitely not
an authority, just trying to decipher the docs.

On Thursday, May 2, 2013 7:29:24 AM UTC-7, Adam Georgiou wrote:

Very interesting, Ken. It sounds like that answers my question in part.
Namely, if your nodes with the primaries are not getting more load than
the nodes without primaries, then it would follow that shard-routing has
little to do with where _peroclate executes. That makes sense too, since
the _percolator index is replicated to all nodes, theoretically (as far
as I understand) giving every node the potential to run a _percolate job.

Thanks a lot for clearing that up!

I'm now left with two questions:

  1. Is there any validity to the claim that more replicas equate to
    better percolate performance? As far as I understand there's not, but I
    have documentation that says otherwise. I'm hoping for confirmation that
    the documentation is wrong.
  2. How does the cluster know which node to execute the _percolate
    job on? Is it always the node that receives the request? Or are similar
    distribution techniques in place with _percolate that are in place with
    _search?

Thanks again for reading,
-Adam

On Thu, May 2, 2013 at 7:34 AM, Kenneth Loafman <ken...@loafman.com<javascript:>

wrote:

We have a 40-node percolate cluster fronted with a load balancer. The
percolate load is spread across all nodes in the cluster and if the load
increases, we just add more nodes to the cluster. Even though there are
only 3 primaries, the nodes with the primaries do not get more load than
the others.

As to parallelizing the queries in the _percolator, I can't answer that.
I think you'd have to get the ES devs to respond.

...Ken

On Wed, May 1, 2013 at 4:20 PM, Taras Shkvarchuk <tar...@gmail.com<javascript:>

wrote:

Now that I am re-reading the documentation, I'm starting to doubt my
understanding due to the mentions of round-robin. It might just be that
throughput is increased by adding mode nodes. If someone who is sure
doesn't reply today, I'll dig in the source for the answer.

On Wednesday, May 1, 2013 12:58:53 PM UTC-7, Adam Georgiou wrote:

Are you sure about that, Taras?

I'm pretty sure that the shards are only used for determining which
node the _percolate runs on. Once a node is determined an on-the-fly
Lucene Index is created in memory -- completely independent of any shards
-- for the to-percolate document. All the queries in the _percolator
index then get dumped over that Lucene index, and the results are returned.

I was under the impression that the kind of parallel work that I assume
you're implying does not happen. In other words, there's no way to get
_percolate to check half your queries on one node and half on another.

Is the above not the case?

On Wed, May 1, 2013 at 3:20 PM, Taras Shkvarchuk tar...@gmail.comwrote:

Each replica can check a different percolate rule.

On Wednesday, May 1, 2013 9:56:29 AM UTC-7, Adam Georgiou wrote:

When percolating a document against an index, how does the cluster
determine which node to run the _percolate on?

It seems to me that the routing value of the document is the most
logical mechanism to do this, which would imply that more shards == better
performance when it comes to percolation (in general).

However, in the PDF distributed at ES core training it's quoted as
saying "the number of target index replicas will increase the
performance"

Should that not read "...target index primaries..."? Or am I
confused?

-Adam

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**MDhjtNO1BsI/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

adamgeorgiou.com

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MDhjtNO1BsI/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--

adamgeorgiou.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.