Percolator Problems

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just a question (I don't have answer for your concern as you did not see anything in logs): do you mean that you host 4 nodes per box? Why don't you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman kenneth@loafman.com a écrit :

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no problems either.

I don't know much about Java logging, so how do you set the log level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato david@pilato.fr wrote:

Just a question (I don't have answer for your concern as you did not see anything in logs): do you mean that you host 4 nodes per box? Why don't you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman kenneth@loafman.com a écrit :

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's a
bit more detail.

  • The calls we make are returning no errors except for an occasional timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have the
filter in the index yet fail to percolate until restarted. We've also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman kenneth@loafman.com wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no problems either.

I don't know much about Java logging, so how do you set the log level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato david@pilato.fr wrote:

Just a question (I don't have answer for your concern as you did not see anything in logs): do you mean that you host 4 nodes per box? Why don't you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman kenneth@loafman.com a écrit :

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am curious to find out if anyone else is also seeing inconsistent results
from heavy duty percolator usage. We have a test case that can be used to
reproduce this problem. Would love to get some help from the community on
this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's a
bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have the
filter in the index yet fail to percolate until restarted. We've also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman <ken...@loafman.com<javascript:>>
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no problems
either.

I don't know much about Java logging, so how do you set the log level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato <da...@pilato.fr<javascript:>>
wrote:

Just a question (I don't have answer for your concern as you did not
see anything in logs): do you mean that you host 4 nodes per box? Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman <ken...@loafman.com<javascript:>>
a écrit :

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that can be
used to reproduce this problem. Would love to get some help from the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's a
bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have the
filter in the index yet fail to percolate until restarted. We've also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no problems
either.

I don't know much about Java logging, so how do you set the log level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr wrote:

Just a question (I don't have answer for your concern as you did not
see anything in logs): do you mean that you host 4 nodes per box? Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a écrit
:

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of
data

since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order
so

I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.

To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.willnauer@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that can be
used to reproduce this problem. Would love to get some help from the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's a
bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have the
filter in the index yet fail to percolate until restarted. We've also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no problems
either.

I don't know much about Java logging, so how do you set the log level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr wrote:

Just a question (I don't have answer for your concern as you did not
see anything in logs): do you mean that you host 4 nodes per box? Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a écrit
:

Hi,

We have two clusters, one for search and the other for percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step with
each other, sometimes losing entire filter sets in some nodes, while
maintaining the entire set in another nodes. This causes loss of
data
since we don't catch the data that matches if it hits the wrong node.
I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named uniquely.
Percolation matches tell us the index(es) where the data is targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new ones
(a very small subset of the total data). I was suspicious that the
delete followed by add was somehow being applied in the wrong order
so
I added a flush/refresh after the delete step and after the add step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
<simon.w...@elasticsearch.com <javascript:>> wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that can
be

used to reproduce this problem. Would love to get some help from the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's a
bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have the
filter in the index yet fail to percolate until restarted. We've also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient
to

keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no
problems

either.

I don't know much about Java logging, so how do you set the log
level,

and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you did
not

see anything in logs): do you mean that you host 4 nodes per box?
Why don't

you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to
see

what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a
écrit

:

Hi,

We have two clusters, one for search and the other for
percolation,

all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a load
balancer. On a fairly regular basis the nodes get out of step
with

each other, sometimes losing entire filter sets in some nodes,
while

maintaining the entire set in another nodes. This causes loss of
data
since we don't catch the data that matches if it hits the wrong
node.

I've looked at the logs and cannot see any indication of problems.

Each filter set is a set of filters and exclusions, named
uniquely.

Percolation matches tell us the index(es) where the data is
targeted.

When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new
ones

(a very small subset of the total data). I was suspicious that
the

delete followed by add was somehow being applied in the wrong
order

so
I added a flush/refresh after the delete step and after the add
step.

We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Simon,

Thanks for the tip! We'll try this out and report back if the problem
persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that can
be

used to reproduce this problem. Would love to get some help from the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's
a

bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with
the

contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have
the

filter in the index yet fail to percolate until restarted. We've
also

seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any
help

would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient
to

keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no
problems

either.

I don't know much about Java logging, so how do you set the log
level,

and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you did
not

see anything in logs): do you mean that you host 4 nodes per box?
Why don't

you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to
see

what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a
écrit

:

Hi,

We have two clusters, one for search and the other for
percolation,

all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a
load

balancer. On a fairly regular basis the nodes get out of step
with

each other, sometimes losing entire filter sets in some nodes,
while

maintaining the entire set in another nodes. This causes loss of
data
since we don't catch the data that matches if it hits the wrong
node.

I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.

Percolation matches tell us the index(es) where the data is
targeted.

When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new
ones

(a very small subset of the total data). I was suspicious that
the

delete followed by add was somehow being applied in the wrong
order

so
I added a flush/refresh after the delete step and after the add
step.

We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the
Google

Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon,

We have been testing the 0.90.0.Beta1 version and it is not breaking
even after 20 million percolations. We've got one thing more to try,
then we may have good news that the problem is fixed. I hope so!

Thanks for the help!

...Ken

On Sat, Mar 2, 2013 at 12:28 PM, Nizam Sayeed nizam@mutualmind.com wrote:

Simon,

Thanks for the tip! We'll try this out and report back if the problem
persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that can
be
used to reproduce this problem. Would love to get some help from the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node. In
addition, we upgraded to the latest Sun Java 7 on all nodes. Here's
a
bit more detail.

  • The calls we make are returning no errors except for an occasional
    timeout.
  • The logs are clear of errors, at least at the default log level, no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the cluster
    and that clears the problem. NOTE, we do NOT reindex the filters or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being processed.
In simple terms, the percolator itself is getting out of sync with
the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just a
few of them. We've seen it where all of the percolator nodes have
the
filter in the index yet fail to percolate until restarted. We've
also
seen it where only a portion of the nodes fail to percolate, yet the
filter is in the _percolator index for all of the nodes. One results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any
help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order to
give more processing engines to the task. 4GB memory is sufficient
to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not memory
bound.

We check return statuses from all ES calls, and those show no
problems
either.

I don't know much about Java logging, so how do you set the log
level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you did
not
see anything in logs): do you mean that you host 4 nodes per box?
Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug to
see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a
écrit
:

Hi,

We have two clusters, one for search and the other for
percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here: https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a
load
balancer. On a fairly regular basis the nodes get out of step
with
each other, sometimes losing entire filter sets in some nodes,
while
maintaining the entire set in another nodes. This causes loss of
data
since we don't catch the data that matches if it hits the wrong
node.
I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.
Percolation matches tell us the index(es) where the data is
targeted.
When a filter set is changed, perhaps multiple times per day, the
process is simple, delete all the old filters and add in the new
ones
(a very small subset of the total data). I was suspicious that
the
delete followed by add was somehow being applied in the wrong
order
so
I added a flush/refresh after the delete step and after the add
step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Cool stuff this is good news!

keep me updated... do you know if there is an issue open for this as well?

simon

On Saturday, March 9, 2013 1:54:21 PM UTC+1, Kenneth Loafman wrote:

Hi Simon,

We have been testing the 0.90.0.Beta1 version and it is not breaking
even after 20 million percolations. We've got one thing more to try,
then we may have good news that the problem is fixed. I hope so!

Thanks for the help!

...Ken

On Sat, Mar 2, 2013 at 12:28 PM, Nizam Sayeed <ni...@mutualmind.com<javascript:>>
wrote:

Simon,

Thanks for the tip! We'll try this out and report back if the problem
persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that
can

be
used to reproduce this problem. Would love to get some help from
the

community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman
wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node.
In

addition, we upgraded to the latest Sun Java 7 on all nodes.
Here's

a
bit more detail.

  • The calls we make are returning no errors except for an
    occasional

timeout.

  • The logs are clear of errors, at least at the default log level,
    no

idea how, or how much, to increase logging.

  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items
    that

should match that filter.

  • After we notice the problem, we do a rolling restart of the
    cluster

and that clears the problem. NOTE, we do NOT reindex the filters
or

make any modification of the filters, we just perform a rolling
restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being
processed.

In simple terms, the percolator itself is getting out of sync with
the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just
a

few of them. We've seen it where all of the percolator nodes have
the
filter in the index yet fail to percolate until restarted. We've
also
seen it where only a portion of the nodes fail to percolate, yet
the

filter is in the _percolator index for all of the nodes. One
results

in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any
help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman <
ken...@loafman.com>

wrote:

There are 4 nodes per box, each 4GB, one per processor, in order
to

give more processing engines to the task. 4GB memory is
sufficient

to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not
memory

bound.

We check return statuses from all ES calls, and those show no
problems
either.

I don't know much about Java logging, so how do you set the log
level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you
did

not
see anything in logs): do you mean that you host 4 nodes per
box?

Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug
to

see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com
a

écrit
:

Hi,

We have two clusters, one for search and the other for
percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here:
https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a
load
balancer. On a fairly regular basis the nodes get out of step
with
each other, sometimes losing entire filter sets in some nodes,
while
maintaining the entire set in another nodes. This causes loss
of

data
since we don't catch the data that matches if it hits the
wrong

node.
I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.
Percolation matches tell us the index(es) where the data is
targeted.
When a filter set is changed, perhaps multiple times per day,
the

process is simple, delete all the old filters and add in the
new

ones
(a very small subset of the total data). I was suspicious
that

the
delete followed by add was somehow being applied in the wrong
order
so
I added a flush/refresh after the delete step and after the
add

step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,

send an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google

Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,

send
an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com <javascript:>.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No open issue because it would be rejected without a working test case
to reproduce it.

So far the Beta1 version has not shown signs of failure, but it has,
on one occasion only, started slowing down severely after about 18
hours. The two tests after that did not repeat, so it may have been
something else (they are running in VM's).

One thing I would like to know... is it possible to find out what node
did the actual percolation? I'm seeing signs on production that seem
indicate that some nodes may possess the filters and others not. It's
just a hunch I'd like to investigate.

...Thanks,
...Ken

On Sat, Mar 9, 2013 at 7:23 AM, simonw
simon.willnauer@elasticsearch.com wrote:

Cool stuff this is good news!

keep me updated... do you know if there is an issue open for this as well?

simon

On Saturday, March 9, 2013 1:54:21 PM UTC+1, Kenneth Loafman wrote:

Hi Simon,

We have been testing the 0.90.0.Beta1 version and it is not breaking
even after 20 million percolations. We've got one thing more to try,
then we may have good news that the problem is fixed. I hope so!

Thanks for the help!

...Ken

On Sat, Mar 2, 2013 at 12:28 PM, Nizam Sayeed ni...@mutualmind.com
wrote:

Simon,

Thanks for the tip! We'll try this out and report back if the problem
persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that
can
be
used to reproduce this problem. Would love to get some help from
the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman
wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node.
In
addition, we upgraded to the latest Sun Java 7 on all nodes.
Here's
a
bit more detail.

  • The calls we make are returning no errors except for an
    occasional
    timeout.
  • The logs are clear of errors, at least at the default log level,
    no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items
    that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the
    cluster
    and that clears the problem. NOTE, we do NOT reindex the filters
    or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being
processed.
In simple terms, the percolator itself is getting out of sync with
the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just
a
few of them. We've seen it where all of the percolator nodes have
the
filter in the index yet fail to percolate until restarted. We've
also
seen it where only a portion of the nodes fail to percolate, yet
the
filter is in the _percolator index for all of the nodes. One
results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any
help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman
ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order
to
give more processing engines to the task. 4GB memory is
sufficient
to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not
memory
bound.

We check return statuses from all ES calls, and those show no
problems
either.

I don't know much about Java logging, so how do you set the log
level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you
did
not
see anything in logs): do you mean that you host 4 nodes per
box?
Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug
to
see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a
écrit
:

Hi,

We have two clusters, one for search and the other for
percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here:
https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a
load
balancer. On a fairly regular basis the nodes get out of step
with
each other, sometimes losing entire filter sets in some nodes,
while
maintaining the entire set in another nodes. This causes loss
of
data
since we don't catch the data that matches if it hits the
wrong
node.
I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.
Percolation matches tell us the index(es) where the data is
targeted.
When a filter set is changed, perhaps multiple times per day,
the
process is simple, delete all the old filters and add in the
new
ones
(a very small subset of the total data). I was suspicious
that
the
delete followed by add was somehow being applied in the wrong
order
so
I added a flush/refresh after the delete step and after the
add
step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

After several days of testing we installed 0.9.0 Beta1 on the
production percolator cluster. It's been a week and we are very
pleased. It is not dropping filters or exclusions out of it's memory
and it is now working as expected. We could not get Beta1 to fail
even though we ran 12M percolations on it while actively adding and
deleting filters and exclusion rules. Since the previous versions
would not survive a small fraction of that many percolations and
filter changes, this is a vast improvement.

I will let you know if it breaks again, but I think it's solid now.

Thanks for the help.

...Ken

On Wed, Mar 13, 2013 at 9:26 AM, Kenneth Loafman kenneth@loafman.com wrote:

No open issue because it would be rejected without a working test case
to reproduce it.

So far the Beta1 version has not shown signs of failure, but it has,
on one occasion only, started slowing down severely after about 18
hours. The two tests after that did not repeat, so it may have been
something else (they are running in VM's).

One thing I would like to know... is it possible to find out what node
did the actual percolation? I'm seeing signs on production that seem
indicate that some nodes may possess the filters and others not. It's
just a hunch I'd like to investigate.

...Thanks,
...Ken

On Sat, Mar 9, 2013 at 7:23 AM, simonw
simon.willnauer@elasticsearch.com wrote:

Cool stuff this is good news!

keep me updated... do you know if there is an issue open for this as well?

simon

On Saturday, March 9, 2013 1:54:21 PM UTC+1, Kenneth Loafman wrote:

Hi Simon,

We have been testing the 0.90.0.Beta1 version and it is not breaking
even after 20 million percolations. We've got one thing more to try,
then we may have good news that the problem is fixed. I hope so!

Thanks for the help!

...Ken

On Sat, Mar 2, 2013 at 12:28 PM, Nizam Sayeed ni...@mutualmind.com
wrote:

Simon,

Thanks for the tip! We'll try this out and report back if the problem
persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it still
happens. I have changed several things in there related to percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with instructions
for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available on
S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and I've
seen it run for 500,000 before it fails. It's the kind of test that
drives one nuts trying to reproduce. The good thing is that once it
fails, it will continue to fail until you do a rolling restart on the
cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing inconsistent
results from heavy duty percolator usage. We have a test case that
can
be
used to reproduce this problem. Would love to get some help from
the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman
wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one node.
In
addition, we upgraded to the latest Sun Java 7 on all nodes.
Here's
a
bit more detail.

  • The calls we make are returning no errors except for an
    occasional
    timeout.
  • The logs are clear of errors, at least at the default log level,
    no
    idea how, or how much, to increase logging.
  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items
    that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the
    cluster
    and that clears the problem. NOTE, we do NOT reindex the filters
    or
    make any modification of the filters, we just perform a rolling
    restart and that clears it.

We really need to understand how the percolator works, since it's
obvious from it's behavior that the filters in the index are not
necessarily representative of the filters that are being
processed.
In simple terms, the percolator itself is getting out of sync with
the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in just
a
few of them. We've seen it where all of the percolator nodes have
the
filter in the index yet fail to percolate until restarted. We've
also
seen it where only a portion of the nodes fail to percolate, yet
the
filter is in the _percolator index for all of the nodes. One
results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but any
help
would be greatly appreciated. It's getting to be a real pain. We
will supply any information you need, even access to the cluster
itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman
ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in order
to
give more processing engines to the task. 4GB memory is
sufficient
to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not
memory
bound.

We check return statuses from all ES calls, and those show no
problems
either.

I don't know much about Java logging, so how do you set the log
level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato da...@pilato.fr
wrote:

Just a question (I don't have answer for your concern as you
did
not
see anything in logs): do you mean that you host 4 nodes per
box?
Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to debug
to
see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman ken...@loafman.com a
écrit
:

Hi,

We have two clusters, one for search and the other for
percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here:
https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all behind a
load
balancer. On a fairly regular basis the nodes get out of step
with
each other, sometimes losing entire filter sets in some nodes,
while
maintaining the entire set in another nodes. This causes loss
of
data
since we don't catch the data that matches if it hits the
wrong
node.
I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.
Percolation matches tell us the index(es) where the data is
targeted.
When a filter set is changed, perhaps multiple times per day,
the
process is simple, delete all the old filters and add in the
new
ones
(a very small subset of the total data). I was suspicious
that
the
delete followed by add was somehow being applied in the wrong
order
so
I added a flush/refresh after the delete step and after the
add
step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send
an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

thank your for reporting this & for pinging back notifying it's solved!

don't hesitate if you have any other problems

simon

On Thursday, March 21, 2013 2:12:57 PM UTC+1, Kenneth Loafman wrote:

After several days of testing we installed 0.9.0 Beta1 on the
production percolator cluster. It's been a week and we are very
pleased. It is not dropping filters or exclusions out of it's memory
and it is now working as expected. We could not get Beta1 to fail
even though we ran 12M percolations on it while actively adding and
deleting filters and exclusion rules. Since the previous versions
would not survive a small fraction of that many percolations and
filter changes, this is a vast improvement.

I will let you know if it breaks again, but I think it's solid now.

Thanks for the help.

...Ken

On Wed, Mar 13, 2013 at 9:26 AM, Kenneth Loafman <ken...@loafman.com<javascript:>>
wrote:

No open issue because it would be rejected without a working test case
to reproduce it.

So far the Beta1 version has not shown signs of failure, but it has,
on one occasion only, started slowing down severely after about 18
hours. The two tests after that did not repeat, so it may have been
something else (they are running in VM's).

One thing I would like to know... is it possible to find out what node
did the actual percolation? I'm seeing signs on production that seem
indicate that some nodes may possess the filters and others not. It's
just a hunch I'd like to investigate.

...Thanks,
...Ken

On Sat, Mar 9, 2013 at 7:23 AM, simonw
<simon.w...@elasticsearch.com <javascript:>> wrote:

Cool stuff this is good news!

keep me updated... do you know if there is an issue open for this as
well?

simon

On Saturday, March 9, 2013 1:54:21 PM UTC+1, Kenneth Loafman wrote:

Hi Simon,

We have been testing the 0.90.0.Beta1 version and it is not breaking
even after 20 million percolations. We've got one thing more to try,
then we may have good news that the problem is fixed. I hope so!

Thanks for the help!

...Ken

On Sat, Mar 2, 2013 at 12:28 PM, Nizam Sayeed ni...@mutualmind.com
wrote:

Simon,

Thanks for the tip! We'll try this out and report back if the
problem

persists.

-- Nizam

On Saturday, March 2, 2013 12:43:00 AM UTC-6, simonw wrote:

before you do all this work can you try the beta and see if it
still

happens. I have changed several things in there related to
percolation.

simon

On Friday, March 1, 2013 10:07:21 PM UTC+1, Kenneth Loafman wrote:

I will have to gist the testcase over the weekend with
instructions

for setup. It's a small Python program and setup for a 2-server
16-node cluster plus associated data. The data will be available
on

S3. The testcase may need to be run several times to get the
percolator to fail. I've seen it fail after 200 percolates and
I've

seen it run for 500,000 before it fails. It's the kind of test
that

drives one nuts trying to reproduce. The good thing is that once
it

fails, it will continue to fail until you do a rolling restart on
the

cluster.

...Thanks,
...Ken

On Fri, Mar 1, 2013 at 3:22 AM, simonw
simon.w...@elasticsearch.com wrote:

can you gist the testcase or open a pull request?

simon

On Friday, March 1, 2013 5:39:34 AM UTC+1, Nizam Sayeed wrote:

I am curious to find out if anyone else is also seeing
inconsistent

results from heavy duty percolator usage. We have a test case
that

can
be
used to reproduce this problem. Would love to get some help
from

the
community on this one. Thanks!

-- Nizam

On Tuesday, February 12, 2013 6:26:04 AM UTC-6, Kenneth Loafman
wrote:

Coming back to this problem, again...

We upgraded the percolator cluster to 0.19.12 and added one
node.

In
addition, we upgraded to the latest Sun Java 7 on all nodes.
Here's
a
bit more detail.

  • The calls we make are returning no errors except for an
    occasional
    timeout.
  • The logs are clear of errors, at least at the default log
    level,

no
idea how, or how much, to increase logging.

  • Invariably an index search shows that the filter is in the
    _percolator index as it should be.
  • When the problem occurs, the percolator starts missing items
    that
    should match that filter.
  • After we notice the problem, we do a rolling restart of the
    cluster
    and that clears the problem. NOTE, we do NOT reindex the
    filters

or
make any modification of the filters, we just perform a
rolling

restart and that clears it.

We really need to understand how the percolator works, since
it's

obvious from it's behavior that the filters in the index are
not

necessarily representative of the filters that are being
processed.
In simple terms, the percolator itself is getting out of sync
with

the
contents of the _percolator index and is causing problems.

This out-of-sync bug can happen in all engines at once, or in
just

a
few of them. We've seen it where all of the percolator nodes
have

the
filter in the index yet fail to percolate until restarted.
We've

also
seen it where only a portion of the nodes fail to percolate,
yet

the
filter is in the _percolator index for all of the nodes. One
results
in total loss of data, the other only a partial loss.

We have yet to figure out how to reliably reproduce this, but
any

help
would be greatly appreciated. It's getting to be a real pain.
We

will supply any information you need, even access to the
cluster

itself (for the ES team) if that would help.

...Thanks,
...Ken

On Sun, Feb 3, 2013 at 8:46 AM, Kenneth Loafman
ken...@loafman.com
wrote:

There are 4 nodes per box, each 4GB, one per processor, in
order

to
give more processing engines to the task. 4GB memory is
sufficient
to
keep all of the filters in memory and provide some caching.
Essentially the percolate process is CPU and IO bound, not
memory
bound.

We check return statuses from all ES calls, and those show
no

problems
either.

I don't know much about Java logging, so how do you set the
log

level,
and what log level should I use?

...Thanks,
...Ken

On Fri, Feb 1, 2013 at 4:18 PM, David Pilato <
da...@pilato.fr>

wrote:

Just a question (I don't have answer for your concern as
you

did
not
see anything in logs): do you mean that you host 4 nodes
per

box?
Why don't
you start 1 node per box but with 16gb RAM?

For your problem, perhaps you should modify log level to
debug

to
see
what´s going on when you update the percolator?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 févr. 2013 à 15:51, Kenneth Loafman <
ken...@loafman.com> a

écrit
:

Hi,

We have two clusters, one for search and the other for
percolation,
all under 0.19.4 running on Ubuntu Lucid server.

The elasticsearch.yml is here:
https://gist.github.com/4691735

There are 24 4GB nodes spread across 6 machines, all
behind a

load
balancer. On a fairly regular basis the nodes get out of
step

with
each other, sometimes losing entire filter sets in some
nodes,

while
maintaining the entire set in another nodes. This causes
loss

of
data
since we don't catch the data that matches if it hits the
wrong
node.
I've looked at the logs and cannot see any indication of
problems.

Each filter set is a set of filters and exclusions, named
uniquely.
Percolation matches tell us the index(es) where the data
is

targeted.
When a filter set is changed, perhaps multiple times per
day,

the
process is simple, delete all the old filters and add in
the

new
ones
(a very small subset of the total data). I was suspicious
that
the
delete followed by add was somehow being applied in the
wrong

order
so
I added a flush/refresh after the delete step and after
the

add
step.
We are still encountering the problem. Any ideas?

...Thanks,
...Ken

--
You received this message because you are subscribed to
the

Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from

it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from

it,
send
an email to elasticsearc...@googlegroups.com.
For more options, visit
https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the
Google

Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it,

send
an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send

an
email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.