Recovery issues on master


(ppearcy) #1

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I shutdown
the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:

Here is the cluster state that shows some indexes not yet recovered:

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(Shay Banon) #2

Hi,

Do you still have the logs? Regarding the timing, checksumming is done on
"write", so it will not affect recovery time. Can you also point to the
settings you have? Specifically, how many machines do you have, and the
settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppearcy@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I shutdown
the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet recovered:
https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(ppearcy) #3

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming is done on
"write", so it will not affect recovery time. Can you also point to the
settings you have? Specifically, how many machines do you have, and the
settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I shutdown
the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet recovered:
https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(ppearcy) #4

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming is done on
"write", so it will not affect recovery time. Can you also point to the
settings you have? Specifically, how many machines do you have, and the
settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I shutdown
the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet recovered:
https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(Shay Banon) #5

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I think is
running on an older version than 0.13), which might cause problems. I have
changed the low level serialization of ip addresses to better handle ipv6
addresses serialization. It would be great if you could separate the two and
run the test.

Second, the reason that you see a yellow state is because you have a
single node. On that node, a shard and its replica will not be allocated
(for obvious reasons), so you will never get to a green status of having all
the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse the local
storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppearcy@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming is
done on

"write", so it will not affect recovery time. Can you also point to the
settings you have? Specifically, how many machines do you have, and the
settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I
shutdown

the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet recovered:
https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(ppearcy) #6

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I think is
running on an older version than 0.13), which might cause problems. I have
changed the low level serialization of ip addresses to better handle ipv6
addresses serialization. It would be great if you could separate the two and
run the test.

Second, the reason that you see a yellow state is because you have a
single node. On that node, a shard and its replica will not be allocated
(for obvious reasons), so you will never get to a green status of having all
the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse the local
storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming is
done on

"write", so it will not affect recovery time. Can you also point to the
settings you have? Specifically, how many machines do you have, and the
settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I
shutdown

the cluster to rename it and move the gateway location. On subsequent
startup, the cluster eventually got into the yellow state, however,
not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet recovered:
https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true when I
saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems that a
large chunk of time is missing from recovery completed timings that
are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(Shay Banon) #7

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppearcy@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't have to
shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the protocol
changes between versions, and you do need to isolate them on the network /
discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be listed
in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the log
(as far as I can tell), one blocked index that I checked was recovered, in
all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate from
the other cluster and run a clean test (with clean logs)? It will be simpler
to try and understand what is going on since you have so many indices and
the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to not
list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I think
is
running on an older version than 0.13), which might cause problems. I
have
changed the low level serialization of ip addresses to better handle ipv6
addresses serialization. It would be great if you could separate the two
and
run the test.

Second, the reason that you see a yellow state is because you have a
single node. On that node, a shard and its replica will not be allocated
(for obvious reasons), so you will never get to a green status of having
all
the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse the
local
storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming
is

done on

"write", so it will not affect recovery time. Can you also point to
the

settings you have? Specifically, how many machines do you have, and
the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I
shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true
when I

saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more details.

As a side note, local recovery seems much slower and it seems
that a

large chunk of time is missing from recovery completed timings
that

are listed. Could this be the time used for the new checksumming?

Thanks,
Paul


(Shay Banon) #8

One more thing: Will be on IRC tomorrow morning... (I also have IRC on
iphone, but for some reason it fails to start..., to much hacking on it...)

On Tue, Nov 9, 2010 at 12:18 AM, Shay Banon shay.banon@elasticsearch.comwrote:

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppearcy@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't have to
shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the protocol
changes between versions, and you do need to isolate them on the network /
discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be listed
in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the log
(as far as I can tell), one blocked index that I checked was recovered, in
all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate from
the other cluster and run a clean test (with clean logs)? It will be simpler
to try and understand what is going on since you have so many indices and
the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to not
list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I think
is
running on an older version than 0.13), which might cause problems. I
have
changed the low level serialization of ip addresses to better handle
ipv6
addresses serialization. It would be great if you could separate the two
and
run the test.

Second, the reason that you see a yellow state is because you have a
single node. On that node, a shard and its replica will not be allocated
(for obvious reasons), so you will never get to a green status of having
all
the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse the
local
storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming
is

done on

"write", so it will not affect recovery time. Can you also point
to the

settings you have? Specifically, how many machines do you have,
and the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I
shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true
when I

saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more
details.

As a side note, local recovery seems much slower and it seems
that a

large chunk of time is missing from recovery completed timings
that

are listed. Could this be the time used for the new
checksumming?

Thanks,
Paul


(ppearcy) #9

Cool, will remove that from the unicast config.

The shards marked as "index not recovered", never recovered after
waiting a couple of hours. I guess I don't understand what it meant by
a "blocked index". Is this different from not recovered? If an index
is blocked, that implies to me the cluster is not healthy, but not
understanding what block means, I'm probably off base :slight_smile:

The cluster health and cluster state where taken at the same time.
Some indexes claimed to be recovered in the logs were still blocked.
Not sure what the disconnect is there.

Quite impressive typing from an iphone :slight_smile:

Thanks for the recommendations. Will try them out and let you know and
catch up with you on IRC tomorrow to let you know where things are at.
Will probably post more details later tonight, as well.

On Nov 8, 3:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing: Will be on IRC tomorrow morning... (I also have IRC on
iphone, but for some reason it fails to start..., to much hacking on it...)

On Tue, Nov 9, 2010 at 12:18 AM, Shay Banon shay.ba...@elasticsearch.comwrote:

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppea...@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't have to
shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the protocol
changes between versions, and you do need to isolate them on the network /
discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be listed
in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the log
(as far as I can tell), one blocked index that I checked was recovered, in
all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate from
the other cluster and run a clean test (with clean logs)? It will be simpler
to try and understand what is going on since you have so many indices and
the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to not
list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I think
is
running on an older version than 0.13), which might cause problems. I
have
changed the low level serialization of ip addresses to better handle
ipv6
addresses serialization. It would be great if you could separate the two
and
run the test.

Second, the reason that you see a yellow state is because you have a
single node. On that node, a shard and its replica will not be allocated
(for obvious reasons), so you will never get to a green status of having
all
the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse the
local
storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is some
really suspect CPU usage for the elasticsearch service. It seems to
fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same condition
and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

Do you still have the logs? Regarding the timing, checksumming
is

done on

"write", so it will not affect recovery time. Can you also point
to the

settings you have? Specifically, how many machines do you have,
and the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards, I
shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have pretty=true
when I

saved them.

As you can see, in the cluster state, there are indexes with:
description: "index not recovered"

I have gateway logging turned up and these shards show as being
successfully recovered. Let me know if you need any more
details.

As a side note, local recovery seems much slower and it seems
that a

large chunk of time is missing from recovery completed timings
that

are listed. Could this be the time used for the new
checksumming?

Thanks,
Paul


(Shay Banon) #10

On Tue, Nov 9, 2010 at 12:36 AM, Paul ppearcy@gmail.com wrote:

Cool, will remove that from the unicast config.

The shards marked as "index not recovered", never recovered after
waiting a couple of hours. I guess I don't understand what it meant by
a "blocked index". Is this different from not recovered? If an index
is blocked, that implies to me the cluster is not healthy, but not
understanding what block means, I'm probably off base :slight_smile:

The "index not recovered" block is added to each index created from the
gateway that not all of its primary shards have been recovered yet. Once all
primary shards for that index have been recovered, then this block is
removed. The idea here is that the cluster will be in red health state since
the index was added, but not all of its primary shards have been recovered
yet as well. Maybe something is off there and for some reason the block is
not removed even though the shards are active, though looking at the code I
don't really see how this can happen...

The cluster health and cluster state where taken at the same time.
Some indexes claimed to be recovered in the logs were still blocked.
Not sure what the disconnect is there.

Strange. Lets give this clean run a go. I want to get the different cluster
communication out of the way, as its basically in a quantum state (hard to
predict what the effect of it are, even when peeking inside the box...).

Quite impressive typing from an iphone :slight_smile:

It would be impressive if it did not take that long :).

Thanks for the recommendations. Will try them out and let you know and
catch up with you on IRC tomorrow to let you know where things are at.
Will probably post more details later tonight, as well.

Great, catch you tomorrow.

On Nov 8, 3:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing: Will be on IRC tomorrow morning... (I also have IRC on
iphone, but for some reason it fails to start..., to much hacking on
it...)

On Tue, Nov 9, 2010 at 12:18 AM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppea...@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't
have to

shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the
protocol

changes between versions, and you do need to isolate them on the
network /

discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be
listed

in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the
log

(as far as I can tell), one blocked index that I checked was recovered,
in

all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate
from

the other cluster and run a clean test (with clean logs)? It will be
simpler

to try and understand what is going on since you have so many indices
and

the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to
not

list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I
think

is

running on an older version than 0.13), which might cause problems.
I

have

changed the low level serialization of ip addresses to better handle
ipv6
addresses serialization. It would be great if you could separate the
two

and

run the test.

Second, the reason that you see a yellow state is because you have
a

single node. On that node, a shard and its replica will not be
allocated

(for obvious reasons), so you will never get to a green status of
having

all

the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse
the

local

storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is
some

really suspect CPU usage for the elasticsearch service. It seems
to

fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same
condition

and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

Do you still have the logs? Regarding the timing,
checksumming

is

done on

"write", so it will not affect recovery time. Can you also
point

to the

settings you have? Specifically, how many machines do you
have,

and the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com
wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards,
I

shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have
pretty=true

when I

saved them.

As you can see, in the cluster state, there are indexes
with:

description: "index not recovered"

I have gateway logging turned up and these shards show as
being

successfully recovered. Let me know if you need any more
details.

As a side note, local recovery seems much slower and it
seems

that a

large chunk of time is missing from recovery completed
timings

that

are listed. Could this be the time used for the new
checksumming?

Thanks,
Paul


(ppearcy) #11

OK, I updated to remove the non-0.13 server from unicast. No more log
exceptions for connectivity, so we're good on that front.

I cleared my gateway and re-created my indexes fresh. I then restarted
the cluster and hit the same issue with all of the indexes empty and
hit the same issue.

So, I then created a simple test, create a bunch of empty indexes, no
mappings and was able to reproduce. Here is what I did to add all the
indexes:

Here is what the cluster state looked like:

I think there is a bug around having more than 50 or so shards.

Let me know if there are any questions or you need anything else.

Thanks,
Paul

On Nov 8, 3:46 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

On Tue, Nov 9, 2010 at 12:36 AM, Paul ppea...@gmail.com wrote:

Cool, will remove that from the unicast config.

The shards marked as "index not recovered", never recovered after
waiting a couple of hours. I guess I don't understand what it meant by
a "blocked index". Is this different from not recovered? If an index
is blocked, that implies to me the cluster is not healthy, but not
understanding what block means, I'm probably off base :slight_smile:

The "index not recovered" block is added to each index created from the
gateway that not all of its primary shards have been recovered yet. Once all
primary shards for that index have been recovered, then this block is
removed. The idea here is that the cluster will be in red health state since
the index was added, but not all of its primary shards have been recovered
yet as well. Maybe something is off there and for some reason the block is
not removed even though the shards are active, though looking at the code I
don't really see how this can happen...

The cluster health and cluster state where taken at the same time.
Some indexes claimed to be recovered in the logs were still blocked.
Not sure what the disconnect is there.

Strange. Lets give this clean run a go. I want to get the different cluster
communication out of the way, as its basically in a quantum state (hard to
predict what the effect of it are, even when peeking inside the box...).

Quite impressive typing from an iphone :slight_smile:

It would be impressive if it did not take that long :).

Thanks for the recommendations. Will try them out and let you know and
catch up with you on IRC tomorrow to let you know where things are at.
Will probably post more details later tonight, as well.

Great, catch you tomorrow.

On Nov 8, 3:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing: Will be on IRC tomorrow morning... (I also have IRC on
iphone, but for some reason it fails to start..., to much hacking on
it...)

On Tue, Nov 9, 2010 at 12:18 AM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppea...@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't
have to

shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the
protocol

changes between versions, and you do need to isolate them on the
network /

discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be
listed

in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the
log

(as far as I can tell), one blocked index that I checked was recovered,
in

all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate
from

the other cluster and run a clean test (with clean logs)? It will be
simpler

to try and understand what is going on since you have so many indices
and

the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to
not

list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I
think

is

running on an older version than 0.13), which might cause problems.
I

have

changed the low level serialization of ip addresses to better handle
ipv6
addresses serialization. It would be great if you could separate the
two

and

run the test.

Second, the reason that you see a yellow state is because you have
a

single node. On that node, a shard and its replica will not be
allocated

(for obvious reasons), so you will never get to a green status of
having

all

the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse
the

local

storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is
some

really suspect CPU usage for the elasticsearch service. It seems
to

fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same
condition

and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

Do you still have the logs? Regarding the timing,
checksumming

is

done on

"write", so it will not affect recovery time. Can you also
point

to the

settings you have? Specifically, how many machines do you
have,

and the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com
wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards,
I

shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have
pretty=true

when I

saved them.

As you can see, in the cluster state, there are indexes
with:

description: "index not recovered"

I have gateway logging turned up and these shards show as
being

successfully recovered. Let me know if you need any more
details.

As a side note, local recovery seems much slower and it
seems

that a

large chunk of time is missing from recovery completed
timings

that

are listed. Could this be the time used for the new
checksumming?

Thanks,
Paul


(ppearcy) #12

For those following this thread, Kimchy pushed some updates to master
this morning that addressed this.

Thanks!!!

On Nov 8, 4:35 pm, Paul ppea...@gmail.com wrote:

OK, I updated to remove the non-0.13 server from unicast. No more log
exceptions for connectivity, so we're good on that front.

I cleared my gateway and re-created my indexes fresh. I then restarted
the cluster and hit the same issue with all of the indexes empty and
hit the same issue.

So, I then created a simple test, create a bunch of empty indexes, no
mappings and was able to reproduce. Here is what I did to add all the
indexes:https://gist.github.com/668474

Here is what the cluster state looked like:https://gist.github.com/668481

I think there is a bug around having more than 50 or so shards.

Let me know if there are any questions or you need anything else.

Thanks,
Paul

On Nov 8, 3:46 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

On Tue, Nov 9, 2010 at 12:36 AM, Paul ppea...@gmail.com wrote:

Cool, will remove that from the unicast config.

The shards marked as "index not recovered", never recovered after
waiting a couple of hours. I guess I don't understand what it meant by
a "blocked index". Is this different from not recovered? If an index
is blocked, that implies to me the cluster is not healthy, but not
understanding what block means, I'm probably off base :slight_smile:

The "index not recovered" block is added to each index created from the
gateway that not all of its primary shards have been recovered yet. Once all
primary shards for that index have been recovered, then this block is
removed. The idea here is that the cluster will be in red health state since
the index was added, but not all of its primary shards have been recovered
yet as well. Maybe something is off there and for some reason the block is
not removed even though the shards are active, though looking at the code I
don't really see how this can happen...

The cluster health and cluster state where taken at the same time.
Some indexes claimed to be recovered in the logs were still blocked.
Not sure what the disconnect is there.

Strange. Lets give this clean run a go. I want to get the different cluster
communication out of the way, as its basically in a quantum state (hard to
predict what the effect of it are, even when peeking inside the box...).

Quite impressive typing from an iphone :slight_smile:

It would be impressive if it did not take that long :).

Thanks for the recommendations. Will try them out and let you know and
catch up with you on IRC tomorrow to let you know where things are at.
Will probably post more details later tonight, as well.

Great, catch you tomorrow.

On Nov 8, 3:22 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing: Will be on IRC tomorrow morning... (I also have IRC on
iphone, but for some reason it fails to start..., to much hacking on
it...)

On Tue, Nov 9, 2010 at 12:18 AM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

On Mon, Nov 8, 2010 at 11:52 PM, Paul ppea...@gmail.com wrote:

Yes, there is another cluster running, but with a different name
(dev-0.12.0 vs dev-0.13.0-SNAPSHOT). That is why you see those errors,
but that's not a problem, is it? I can't shutdown the cluster others
are using for development while I try to bring up the replacement.

Since the serialization changed, it might pose a problem. You don't
have to

shutdown other clusters, but you do need to separate them. If you use
unicast (which you seem to use), just don't list that other cluster ip
address. If you use multicast, use a different ip address.

The idea is that cluster name will provide isolation between similar
versions, but while the protocol is evolving, its not enough if the
protocol

changes between versions, and you do need to isolate them on the
network /

discovery setting level.

I'd expect the cluster to be in the red (not yellow) state when there
are indexes that are not recovered. Please look at the cluster state
link I sent and you will see there are many stating "Index not
recovered". Am I missing something here?

Good point. Currently, blocked indices are not affecting the cluster
health... . Not sure if they should or not... . Maybe they should be
listed

in cluster state as well?

Regarding the blocks, when did you run the cluster health? Based on the
log

(as far as I can tell), one blocked index that I checked was recovered,
in

all the runs that I can see.

The recovery times are completely different from 0.12. There are huge
time gaps between the shard recoveries in some cases. I can point out
specific ones in the logs if that helps.

Nothing changed there between 0.12 and master. Can you try and isolate
from

the other cluster and run a clean test (with clean logs)? It will be
simpler

to try and understand what is going on since you have so many indices
and

the current log has several restarts.

I am on IRC, hop on if you think it'd help to have better back and
forth.

Not fully online. Am hacking on my iphone (even just pushed a fix to
not

list empty index level blocks)... :slight_smile:

Thanks,
Paul

On Nov 8, 2:43 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

First, it seems like it tries to connect to 10.2.20.164 (which I
think

is

running on an older version than 0.13), which might cause problems.
I

have

changed the low level serialization of ip addresses to better handle
ipv6
addresses serialization. It would be great if you could separate the
two

and

run the test.

Second, the reason that you see a yellow state is because you have
a

single node. On that node, a shard and its replica will not be
allocated

(for obvious reasons), so you will never get to a green status of
having

all

the shards and replicas allocated.

Regarding the recovery, it seems to work well, and it does reuse
the

local

storage. So should be quick.

-shay.banon

On Mon, Nov 8, 2010 at 11:33 PM, Paul ppea...@gmail.com wrote:

An intersting observation, I'm in the yellow state and there is
some

really suspect CPU usage for the elasticsearch service. It seems
to

fluctuate around with all 24 cores running at ~15% and sometimes
another core pegged at 100%. There is no indexing or searching
occurring.

Thanks

On Nov 8, 2:27 pm, Paul ppea...@gmail.com wrote:

Hey,
My log is available here:
http://dl.dropbox.com/u/12095883/dev-0.13.0-SNAPSHOT.log

It shows a few recovery attempts, all exhibiting the same
condition

and timings. Here are the settings that I am running:
https://gist.github.com/668300

The current testing I am doing is just with a single node.

Let me know if you need anything else.

Thanks,
Paul

On Nov 8, 2:14 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

Do you still have the logs? Regarding the timing,
checksumming

is

done on

"write", so it will not affect recovery time. Can you also
point

to the

settings you have? Specifically, how many machines do you
have,

and the

settings for the nodes.

-shay.banon

On Mon, Nov 8, 2010 at 10:55 PM, Paul ppea...@gmail.com
wrote:

Hey,
Running latest version:
number: "0.13.0-SNAPSHOT"
date: "2010-11-07T07:05:01"

Was able to build content up without a problem. Afterwards,
I

shutdown

the cluster to rename it and move the gateway location. On
subsequent

startup, the cluster eventually got into the yellow state,
however,

not all shards were recovered.

Here is the cluster health that shows yellow:
https://gist.github.com/668233

Here is the cluster state that shows some indexes not yet
recovered:

https://gist.github.com/668241

Apologies for the ugly looking gists, didn't have
pretty=true

when I

saved them.

As you can see, in the cluster state, there are indexes
with:

description: "index not recovered"

I have gateway logging turned up and these shards show as
being

successfully recovered. Let me know if you need any more
details.

As a side note, local recovery seems much slower and it
seems

that a

large chunk of time is missing from recovery completed
timings

that

are listed. Could this be the time used for the new
checksumming?

Thanks,
Paul


(system) #13