Disappearing Shards


(Jacob Perkins) #1

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load or
otherwise) and I bring it back up, one or more shards go missing. From
the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously had.
I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob


(Shay Banon) #2

How many nodes are in the cluster? Can you post the logs somewhere so I can
have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins jacob.a.perkins@gmail.comwrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load or
otherwise) and I bring it back up, one or more shards go missing. From
the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously had.
I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob


(Jacob Perkins) #3

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so I can
have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins jacob.a.perk...@gmail.comwrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load or
otherwise) and I bring it back up, one or more shards go missing. From
the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously had.
I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob


(Shay Banon) #4

Great, thanks!. The local gateway allocation has been improved in upcoming
0.14 for some cases where this might happen, but they should be very rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perkins@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so I
can
have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load or
otherwise) and I bring it back up, one or more shards go missing. From
the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously had.
I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob


(Jacob Perkins) #5

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in upcoming
0.14 for some cases where this might happen, but they should be very rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so I
can
have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load or
otherwise) and I bring it back up, one or more shards go missing. From
the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously had.
I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293214331&Signature=razCSXlxHjYUXXic985pMTmSLMQ%3D

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob


(Shay Banon) #6

I can't download the link (getting link broken...). What do you mean by
after adding more nodes the entire thing went down?

On Tue, Dec 21, 2010 at 8:23 PM, Jacob Perkins jacob.a.perkins@gmail.comwrote:

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in
upcoming
0.14 for some cases where this might happen, but they should be very
rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so
I

can

have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load
or

otherwise) and I bring it back up, one or more shards go missing.
From

the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously
had.

I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293214331&Signature=razCSXlxHjYUXXic985pMTmSLMQ%3D

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob


(Jacob Perkins) #7

On Dec 21, 12:26 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't download the link (getting link broken...). What do you mean by
after adding more nodes the entire thing went down?

On Tue, Dec 21, 2010 at 8:23 PM, Jacob Perkins jacob.a.perk...@gmail.comwrote:

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in
upcoming
0.14 for some cases where this might happen, but they should be very
rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so
I

can

have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load
or

otherwise) and I bring it back up, one or more shards go missing.
From

the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously
had.

I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=...

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob

Evidently I can't make one for the top level directory, here's a list
(one from each node that was up at the time)

http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode9.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=sTWEhQkdbYh3ovfh6ZNW9o%2BCcG4%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode7.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=TMzpoCum6XQCq8m/NlFxvhM7uIs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode6.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=13Z06AOis6qF16/YzXB6Bpp5dK0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode5.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GTYuONzYQbZskJEHKz8BKNL4ECY%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode4.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WGmkP6uT%2B7Zm5JWJw8/2vjWIbqI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode3.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=DLOz2LCt/X0YEi645ZBCoq0fTfA%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode2.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=1VpSXno41xjggjrRwT1/IGfbrO0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode15.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=%2BaqrNPOwIYGyqWmIMQd8dYPDhvs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode14.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=ygg/L/hlfVI/5VMebsaA8qt1BDc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode13.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=lUB8a9vnkbnFrQyKT%2B6K0hSPmR8%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode11.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=vj%2BpZLMZ3WtHaGOt3yZ4qDNHbFI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode10.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GMHNiWCrNa2sNQGtkv7o4vcBSFc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode1.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WjnkN4tiaaMBx6z%2BUopGsnNBlTU%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode0.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=eWHVte8CohqDF3ZVWOExxVyGSf4%3D

After adding more machines a (repair?) started and barraged the new
machines with data. Then the cluster went yellow and shortly after
everything stopped responding.

--jacob


(Jacob Perkins) #8

On Dec 21, 12:26 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't download the link (getting link broken...). What do you mean by
after adding more nodes the entire thing went down?

On Tue, Dec 21, 2010 at 8:23 PM, Jacob Perkins jacob.a.perk...@gmail.comwrote:

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in
upcoming
0.14 for some cases where this might happen, but they should be very
rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so
I

can

have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load
or

otherwise) and I bring it back up, one or more shards go missing.
From

the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously
had.

I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=...

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob

Evidently I can't make one for the top level directory, here's a list
(one from each node that was up at the time)

http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode9.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=sTWEhQkdbYh3ovfh6ZNW9o%2BCcG4%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode7.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=TMzpoCum6XQCq8m/NlFxvhM7uIs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode6.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=13Z06AOis6qF16/YzXB6Bpp5dK0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode5.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GTYuONzYQbZskJEHKz8BKNL4ECY%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode4.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WGmkP6uT%2B7Zm5JWJw8/2vjWIbqI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode3.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=DLOz2LCt/X0YEi645ZBCoq0fTfA%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode2.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=1VpSXno41xjggjrRwT1/IGfbrO0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode15.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=%2BaqrNPOwIYGyqWmIMQd8dYPDhvs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode14.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=ygg/L/hlfVI/5VMebsaA8qt1BDc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode13.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=lUB8a9vnkbnFrQyKT%2B6K0hSPmR8%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode11.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=vj%2BpZLMZ3WtHaGOt3yZ4qDNHbFI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode10.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GMHNiWCrNa2sNQGtkv7o4vcBSFc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode1.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WjnkN4tiaaMBx6z%2BUopGsnNBlTU%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode0.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=eWHVte8CohqDF3ZVWOExxVyGSf4%3D

After adding more machines a (repair?) started and barraged the new
machines with data. Then the cluster went yellow and shortly after
everything stopped responding.

--jacob


(Jacob Perkins) #9

On Dec 21, 12:26 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't download the link (getting link broken...). What do you mean by
after adding more nodes the entire thing went down?

On Tue, Dec 21, 2010 at 8:23 PM, Jacob Perkins jacob.a.perk...@gmail.comwrote:

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in
upcoming
0.14 for some cases where this might happen, but they should be very
rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so
I

can

have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load
or

otherwise) and I bring it back up, one or more shards go missing.
From

the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously
had.

I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=...

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob

Evidently I can't make one for the top level directory, here's a list
(one from each node that was up at the time)

http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode9.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=sTWEhQkdbYh3ovfh6ZNW9o%2BCcG4%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode7.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=TMzpoCum6XQCq8m/NlFxvhM7uIs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode6.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=13Z06AOis6qF16/YzXB6Bpp5dK0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode5.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GTYuONzYQbZskJEHKz8BKNL4ECY%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode4.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WGmkP6uT%2B7Zm5JWJw8/2vjWIbqI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode3.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=DLOz2LCt/X0YEi645ZBCoq0fTfA%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode2.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=1VpSXno41xjggjrRwT1/IGfbrO0%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode15.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=%2BaqrNPOwIYGyqWmIMQd8dYPDhvs%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode14.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=ygg/L/hlfVI/5VMebsaA8qt1BDc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode13.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=lUB8a9vnkbnFrQyKT%2B6K0hSPmR8%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode11.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=vj%2BpZLMZ3WtHaGOt3yZ4qDNHbFI%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode10.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=GMHNiWCrNa2sNQGtkv7o4vcBSFc%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode1.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=WjnkN4tiaaMBx6z%2BUopGsnNBlTU%3D
http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_logs_esnode0.tar.bz2?AWSAccessKeyId=02S6Y1EFA7ZZ7KCZH3G2&Expires=1293218490&Signature=eWHVte8CohqDF3ZVWOExxVyGSf4%3D

After adding more machines a (repair?) started and barraged the new
machines with data. Then the cluster went yellow and shortly after
everything stopped responding.

--jacob


(Jacob Perkins) #10

On Dec 21, 1:28 pm, Jacob Perkins jacob.a.perk...@gmail.com wrote:

On Dec 21, 12:26 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't download the link (getting link broken...). What do you mean by
after adding more nodes the entire thing went down?

On Tue, Dec 21, 2010 at 8:23 PM, Jacob Perkins jacob.a.perk...@gmail.comwrote:

On Dec 20, 11:51 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Great, thanks!. The local gateway allocation has been improved in
upcoming
0.14 for some cases where this might happen, but they should be very
rare,
so wondered if you hit that...

On Mon, Dec 20, 2010 at 10:15 PM, Jacob Perkins
jacob.a.perk...@gmail.comwrote:

Happened on Friday last week and I can't seem to find the right set of
log files. When it happens again I'll send you more detailed notes +
logs. Thanks,

--jacob

On Dec 20, 12:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

How many nodes are in the cluster? Can you post the logs somewhere so
I

can

have a look at whats going on (with notes on the time it happened).

On Mon, Dec 20, 2010 at 7:45 PM, Jacob Perkins <
jacob.a.perk...@gmail.com>wrote:

I'm experiencing what I believe may be a bug in the elasticsearch
recovery process. That is, when a node goes down (either from load
or

otherwise) and I bring it back up, one or more shards go missing.
From

the logs it seems that when the data node is brought back up it is
given an empty shard thereby overwriting the shard it previously
had.

I'm using elasticsearch-0.13.0 with replicas == 1 on all indices.

--jacob

So, I packaged all the logs and made them available here:

http://infochimps-test.s3.amazonaws.com/elasticsearch?AWSAccessKeyId=...

I'm not sure what I should be looking for. (After adding more machines
to the cluster the entire thing went down, so ...)

--jacob

Evidently I can't make one for the top level directory, here's a list
(one from each node that was up at the time)

http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...http://infochimps-test.s3.amazonaws.com/elasticsearch/elasticsearch_l...

After adding more machines a (repair?) started and barraged the new
machines with data. Then the cluster went yellow and shortly after
everything stopped responding.

--jacob

New nodes launched at 2010-12-20 12:25 CST

--jacob


(system) #11