Cluster health issues with 0.9.0-SNAPSHOT


(Stephen-2) #1

Hey guys,

I've been doing some fairly ornate integration testing with my app and
ES, and have hit an interesting snag. What I'm doing is creating
three nodes and shoving a bunch of data into it, but with a twist --
some of my indexes are empty (as in, the index exists but has no
documents in it yet).

This state is necessary for my application since people can create new
projects without putting any data into them and still expect their
global searches to work. If I don't create empty indexes (as in,
client.admin().indices().prepareCreate(absoluteName).execute().actionGet();),
then all my searches fail due to shard failures.

So, the side effect of having empty indexes is that if one of my nodes
goes down, the cluster status goes yellow and stays that way. It
seems that the indexes that were NOT empty get replicated properly,
but the ones that ARE empty get ignored. This also causes very
strange recovery failures when I shut down a second node, which I'm
hoping is related to the emptiness problem.

Here's what my work directory looks like with all three nodes happy:

starkey-imac:nodes stephen$ du -sh *
1.1M 0
1.2M 1
1.1M 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"green","timed_out":false,"active_primary_shards":
220,"active_shards":440,"relocating_shards":0}s

And when I kill one of them:

starkey-imac:nodes stephen$ du -sh *
1.5M 0
1.2M 1
0B 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"yellow","timed_out":false,"active_primary_shards":
220,"active_shards":359,"relocating_shards":0}

The problem is, I'm kind of paranoid -- I refuse to allow my
application to write into the cluster unless it's green. However,
this seriously puts a crimp in my style, since the cluster never goes
green after this happens.

Any thoughts?

Thanks in advance,

Stephen.


(Shay Banon) #2

I've ran a simple test where I created a single empty index on a three node
cluster, and the cluster health was green. Then, I shutdown one node,
cluster health is still green. Then I shutdown the second node (which
results in one node left) and health is yellow (as expected).

Is there a chance for a recreation?

-shay.banon

On Fri, Jul 16, 2010 at 3:14 PM, Stephen scstarkey@gmail.com wrote:

Hey guys,

I've been doing some fairly ornate integration testing with my app and
ES, and have hit an interesting snag. What I'm doing is creating
three nodes and shoving a bunch of data into it, but with a twist --
some of my indexes are empty (as in, the index exists but has no
documents in it yet).

This state is necessary for my application since people can create new
projects without putting any data into them and still expect their
global searches to work. If I don't create empty indexes (as in,

client.admin().indices().prepareCreate(absoluteName).execute().actionGet();),
then all my searches fail due to shard failures.

So, the side effect of having empty indexes is that if one of my nodes
goes down, the cluster status goes yellow and stays that way. It
seems that the indexes that were NOT empty get replicated properly,
but the ones that ARE empty get ignored. This also causes very
strange recovery failures when I shut down a second node, which I'm
hoping is related to the emptiness problem.

Here's what my work directory looks like with all three nodes happy:

starkey-imac:nodes stephen$ du -sh *
1.1M 0
1.2M 1
1.1M 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"green","timed_out":false,"active_primary_shards":
220,"active_shards":440,"relocating_shards":0}s

And when I kill one of them:

starkey-imac:nodes stephen$ du -sh *
1.5M 0
1.2M 1
0B 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"yellow","timed_out":false,"active_primary_shards":
220,"active_shards":359,"relocating_shards":0}

The problem is, I'm kind of paranoid -- I refuse to allow my
application to write into the cluster unless it's green. However,
this seriously puts a crimp in my style, since the cluster never goes
green after this happens.

Any thoughts?

Thanks in advance,

Stephen.


(Stephen-2) #3

Unfortunately, the only thing I can do to reproduce it with simple
data requires a lot of trial and error. That and more often than not
I run out of memory if I play with it too much. I would need a much
more powerful machine to get it to work right.

All I can say is it seems to happen reliably when I have 3 nodes, and
alternate between turning one of them off and on again in quick
succession. Perhaps the problem is I don't leave time to allow them
to replicate back. But, I could swear I waited for all the indexes to
turn green again before killing the next node.

I'll play with it more.

Stephen.

On Jul 16, 1:04 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I've ran a simple test where I created a single empty index on a three node
cluster, and the cluster health was green. Then, I shutdown one node,
cluster health is still green. Then I shutdown the second node (which
results in one node left) and health is yellow (as expected).

Is there a chance for a recreation?

-shay.banon

On Fri, Jul 16, 2010 at 3:14 PM, Stephen scstar...@gmail.com wrote:

Hey guys,

I've been doing some fairly ornate integration testing with my app and
ES, and have hit an interesting snag. What I'm doing is creating
three nodes and shoving a bunch of data into it, but with a twist --
some of my indexes are empty (as in, the index exists but has no
documents in it yet).

This state is necessary for my application since people can create new
projects without putting any data into them and still expect their
global searches to work. If I don't create empty indexes (as in,

client.admin().indices().prepareCreate(absoluteName).execute().actionGet(); ),
then all my searches fail due to shard failures.

So, the side effect of having empty indexes is that if one of my nodes
goes down, the cluster status goes yellow and stays that way. It
seems that the indexes that were NOT empty get replicated properly,
but the ones that ARE empty get ignored. This also causes very
strange recovery failures when I shut down a second node, which I'm
hoping is related to the emptiness problem.

Here's what my work directory looks like with all three nodes happy:

starkey-imac:nodes stephen$ du -sh *
1.1M 0
1.2M 1
1.1M 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"green","timed_out":false,"active_primary_shards":
220,"active_shards":440,"relocating_shards":0}s

And when I kill one of them:

starkey-imac:nodes stephen$ du -sh *
1.5M 0
1.2M 1
0B 2

starkey-imac:~ stephen$ curl -XGET 'http://192.168.0.5:9200/_cluster/
health'
{"status":"yellow","timed_out":false,"active_primary_shards":
220,"active_shards":359,"relocating_shards":0}

The problem is, I'm kind of paranoid -- I refuse to allow my
application to write into the cluster unless it's green. However,
this seriously puts a crimp in my style, since the cluster never goes
green after this happens.

Any thoughts?

Thanks in advance,

Stephen.


(Stephen-2) #4

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class: http://www.sourcepod.com/puyttw27-3473
Detailed Results: http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQVFMWXdBN01lNXc&hl=en&pli=1#gid=0

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Shay Banon) #5

Hi,

Thanks for the test. I can't access your spreadsheet, but I managed to
recreate it locally (happens very sporadically). Will look into this and
update you once I fix it.

-shay.banon

On Sat, Jul 17, 2010 at 7:14 PM, Stephen scstarkey@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class: http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQVFMWXdBN01lNXc&hl=en&pli=1#gid=0

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Shay Banon) #6

Hi,

I just pushed a fix for this, I can't get it to happen with your test. If
you can verify it, it would be great. I will write some built in similar
tests for this that will run as part of the test suite of elasticsearch.

-shay.banon

On Sat, Jul 17, 2010 at 11:29 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Hi,

Thanks for the test. I can't access your spreadsheet, but I managed to
recreate it locally (happens very sporadically). Will look into this and
update you once I fix it.

-shay.banon

On Sat, Jul 17, 2010 at 7:14 PM, Stephen scstarkey@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class: http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQVFMWXdBN01lNXc&hl=en&pli=1#gid=0

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Stephen-2) #7

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ...

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Shay Banon) #8

Maybe you missed my previous email from Jul 18, here it is again :slight_smile: :

"
Hi,

I just pushed a fix for this, I can't get it to happen with your test
anymore. If you can verify it, it would be great. I will write some built in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstarkey@gmail.com wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ...

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Stephen-2) #9

Awesome! Sorry I missed your message. I'll take a look at it. Glad
I was able to be of assistance.

I hope my continuous nagging isn't too much of a bother. :slight_smile: We're
evaluating ES hard-core for use in a bunch of critical systems and
can't wait for it to hit the maturity level where we can use it. I'm
confident that your approach is the right one -- hope I can contribute
if only with test cases.

Regards,

Stephen.

On Jul 20, 8:48 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Maybe you missed my previous email from Jul 18, here it is again :slight_smile: :

"
Hi,

I just pushed a fix for this, I can't get it to happen with your test
anymore. If you can verify it, it would be great. I will write some built in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstar...@gmail.com wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to stop
    moving (after waiting 15-30 seconds)
  6. Used default values for everything -- no special configuration of
    indexes, replication, shard counts, etc.
  7. In the case where the indexes never turn green, re-run the test a
    couple times
  8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more than
    one node
  2. If a node is lost, the cluster should always rebalance until it
    becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ...

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes to
create] [nodata, somedata, or alldata]


(Shay Banon) #10

No bother at all, this is exactly the kind of things that will help
elasticsearch mature. 0.9 is shaping up to be a really good release!

-shay.banon

On Wed, Jul 21, 2010 at 2:02 AM, Stephen scstarkey@gmail.com wrote:

Awesome! Sorry I missed your message. I'll take a look at it. Glad
I was able to be of assistance.

I hope my continuous nagging isn't too much of a bother. :slight_smile: We're
evaluating ES hard-core for use in a bunch of critical systems and
can't wait for it to hit the maturity level where we can use it. I'm
confident that your approach is the right one -- hope I can contribute
if only with test cases.

Regards,

Stephen.

On Jul 20, 8:48 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Maybe you missed my previous email from Jul 18, here it is again :slight_smile: :

"
Hi,

I just pushed a fix for this, I can't get it to happen with your test
anymore. If you can verify it, it would be great. I will write some built
in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstar...@gmail.com wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that seem
to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to
    stop

moving (after waiting 15-30 seconds)
6. Used default values for everything -- no special configuration of
indexes, replication, shard counts, etc.
7. In the case where the indexes never turn green, re-run the test a
couple times
8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more
    than

one node
2. If a node is lost, the cluster should always rebalance until it
becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has most
success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ.
..

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes
to

create] [nodata, somedata, or alldata]


(Shay Banon) #11

One more thing, as I want to release 0.9 soon (matter of days), it would
really help if you can verify on your end that its fixed. If there are still
problems, I would love to try and see if they can be tackled for 0.9.

-shay.banon

On Wed, Jul 21, 2010 at 2:04 AM, Shay Banon shay.banon@elasticsearch.comwrote:

No bother at all, this is exactly the kind of things that will help
elasticsearch mature. 0.9 is shaping up to be a really good release!

-shay.banon

On Wed, Jul 21, 2010 at 2:02 AM, Stephen scstarkey@gmail.com wrote:

Awesome! Sorry I missed your message. I'll take a look at it. Glad
I was able to be of assistance.

I hope my continuous nagging isn't too much of a bother. :slight_smile: We're
evaluating ES hard-core for use in a bunch of critical systems and
can't wait for it to hit the maturity level where we can use it. I'm
confident that your approach is the right one -- hope I can contribute
if only with test cases.

Regards,

Stephen.

On Jul 20, 8:48 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Maybe you missed my previous email from Jul 18, here it is again :slight_smile: :

"
Hi,

I just pushed a fix for this, I can't get it to happen with your test
anymore. If you can verify it, it would be great. I will write some
built in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstar...@gmail.com wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that
seem

to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to
    stop

moving (after waiting 15-30 seconds)
6. Used default values for everything -- no special configuration of
indexes, replication, shard counts, etc.
7. In the case where the indexes never turn green, re-run the test a
couple times
8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more
    than

one node
2. If a node is lost, the cluster should always rebalance until it
becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has
most

success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ.
..

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes
to

create] [nodata, somedata, or alldata]


(Stephen-2) #12

Hey Shay,

I just re-ran my automated tests and executed a bunch of manual tests
and everything seems great on the replication and recovery front. I
found a minor issue where documents get duplicated when recovered from
the filesystem gateway, and will probably do some cross-computer
testing as well (thus far have been focusing on same-machine stuff).

Can't wait for the release!

Stephen.

On Jul 20, 6:05 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing, as I want to release 0.9 soon (matter of days), it would
really help if you can verify on your end that its fixed. If there are still
problems, I would love to try and see if they can be tackled for 0.9.

-shay.banon

On Wed, Jul 21, 2010 at 2:04 AM, Shay Banon shay.ba...@elasticsearch.comwrote:

No bother at all, this is exactly the kind of things that will help
elasticsearch mature. 0.9 is shaping up to be a really good release!

-shay.banon

On Wed, Jul 21, 2010 at 2:02 AM, Stephen scstar...@gmail.com wrote:

Awesome! Sorry I missed your message. I'll take a look at it. Glad
I was able to be of assistance.

I hope my continuous nagging isn't too much of a bother. :slight_smile: We're
evaluating ES hard-core for use in a bunch of critical systems and
can't wait for it to hit the maturity level where we can use it. I'm
confident that your approach is the right one -- hope I can contribute
if only with test cases.

Regards,

Stephen.

On Jul 20, 8:48 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Maybe you missed my previous email from Jul 18, here it is again :slight_smile: :

"
Hi,

I just pushed a fix for this, I can't get it to happen with your test
anymore. If you can verify it, it would be great. I will write some
built in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstar...@gmail.com wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class which
starts up its own nodes. I then ran a series of test cases that
seem

to reproduce the health issue fairly regularly. The methodology I
used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to 100%
    (alldata)
  4. When the number of nodes is greater than 2, and all indexes are
    GREEN, and then kill the LAST node
  5. Before declaring the test to be a failure, wait for the log to
    stop

moving (after waiting 15-30 seconds)
6. Used default values for everything -- no special configuration of
indexes, replication, shard counts, etc.
7. In the case where the indexes never turn green, re-run the test a
couple times
8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is more
    than

one node
2. If a node is lost, the cluster should always rebalance until it
becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has
most

success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:
http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ.
..

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [# indexes
to

create] [nodata, somedata, or alldata]


(Shay Banon) #13

Hi,

Great that this works. Are you sure that documents gets duplicate when
recovery from gateway? If this happens, can you recreate it?

-shay.banon

On Thu, Jul 22, 2010 at 2:58 PM, Stephen scstarkey@gmail.com wrote:

Hey Shay,

I just re-ran my automated tests and executed a bunch of manual tests
and everything seems great on the replication and recovery front. I
found a minor issue where documents get duplicated when recovered from
the filesystem gateway, and will probably do some cross-computer
testing as well (thus far have been focusing on same-machine stuff).

Can't wait for the release!

Stephen.

On Jul 20, 6:05 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing, as I want to release 0.9 soon (matter of days), it would
really help if you can verify on your end that its fixed. If there are
still
problems, I would love to try and see if they can be tackled for 0.9.

-shay.banon

On Wed, Jul 21, 2010 at 2:04 AM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

No bother at all, this is exactly the kind of things that will help
elasticsearch mature. 0.9 is shaping up to be a really good release!

-shay.banon

On Wed, Jul 21, 2010 at 2:02 AM, Stephen scstar...@gmail.com wrote:

Awesome! Sorry I missed your message. I'll take a look at it. Glad
I was able to be of assistance.

I hope my continuous nagging isn't too much of a bother. :slight_smile: We're
evaluating ES hard-core for use in a bunch of critical systems and
can't wait for it to hit the maturity level where we can use it. I'm
confident that your approach is the right one -- hope I can contribute
if only with test cases.

Regards,

Stephen.

On Jul 20, 8:48 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Maybe you missed my previous email from Jul 18, here it is again :slight_smile:
:

"
Hi,

I just pushed a fix for this, I can't get it to happen with your
test

anymore. If you can verify it, it would be great. I will write some
built in
similar tests for this that will run as part of the test suite of
elasticsearch.
"

-shay.banon

On Tue, Jul 20, 2010 at 2:19 PM, Stephen scstar...@gmail.com
wrote:

... is there a chance anybody's looked at this?

On Jul 17, 11:14 am, Stephen scstar...@gmail.com wrote:

Okay,

So I've spent all morning writing a fairly involved test class
which

starts up its own nodes. I then ran a series of test cases that
seem

to reproduce the health issue fairly regularly. The methodology
I

used to execute the tests is as follows:

  1. Vary the number of nodes from 2 to 4
  2. Vary the number of indexes between 5 and 20
  3. Vary the mount of data from 0% (nodata) to 25% (somedata) to
    100%

(alldata)
4. When the number of nodes is greater than 2, and all indexes
are

GREEN, and then kill the LAST node
5. Before declaring the test to be a failure, wait for the log
to

stop

moving (after waiting 15-30 seconds)
6. Used default values for everything -- no special
configuration of

indexes, replication, shard counts, etc.
7. In the case where the indexes never turn green, re-run the
test a

couple times
8. Allocated -Xmx1024m to the entire test case

Assumptions:

  1. A cluster should always strive toward GREEN when there is
    more

than

one node
2. If a node is lost, the cluster should always rebalance until
it

becomes GREEN again (assuming more than one node)

Results:

In many cases, ES fails to maintain appropriate balance. It has
most

success when there are 5 indexes, but still not perfect. As the
number of nodes increase, the odds for failure increase on node
failure. Also, as the number of indexes increase, the odds for
failure increase.

Test Class:http://www.sourcepod.com/puyttw27-3473
Detailed Results:

http://spreadsheets.google.com/ccc?key=0AoyOQtG1DAe4dDhrVXpxMkZOMkYwQ.

..

The only dependencies I can see are the ones necessary to build
elasticsearch.

Usage: com.pegby.ElasticSearchTester [# nodes to start] [#
indexes

to

create] [nodata, somedata, or alldata]


(system) #14