EC2 Cluster Failure


(James Cook) #1

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:

My two nodes do not discover each other and fail to cluster. Ports are open,
and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(James Cook) #2

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310
]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310
]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason
for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.com wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(James Cook) #3

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook jcook@tracermedia.com wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.comwrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see
this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #4

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook-2) #5

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).
On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook jcook@tracermedia.com wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157
Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.com wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #6

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #7

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters
I am using. Both servers are deploying the exact same WAR file with an
Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #8

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #9

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #10

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #11

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

  1. server.start()
  2. Retrieve the current health status
  3. If the status is RED, I log this fact and wait for at least YELLOW
    status
  4. Check the current health status
  5. Log the "final" status
  6. If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume
that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #12

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (https://gist.github.com/977580)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(Shay Banon) #13

Btw, worked on improving the unicast discovery in master, on top of the concurrency improvements when connecting, its now much more lightweight.

On Tuesday, June 14, 2011 at 9:21 PM, Shay Banon wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (https://gist.github.com/977580)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #14

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing
as a RED status and continue looping (with some delay) until I get at least
a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The health API needs to find hte master of the cluster in order to get
it. If its not available, then you will get the exception. You should treat
it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

  1. server.start()
  2. Retrieve the current health status
  3. If the status is RED, I log this fact and wait for at least YELLOW
    status
  4. Check the current health status
  5. Log the "final" status
  6. If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I
assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #15

Yea... . I was actually thinking about this way back when health API was introduced, and considered treating it as RED status, but I think its a different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing as a RED status and continue looping (with some delay) until I get at least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (https://gist.github.com/977580)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #16

Perhaps that is why you added a "masterNodeTimeout" property? Can I use this
somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yea... . I was actually thinking about this way back when health API was
introduced, and considered treating it as RED status, but I think its a
different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing
as a RED status and continue looping (with some delay) until I get at least
a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The health API needs to find hte master of the cluster in order to get
it. If its not available, then you will get the exception. You should treat
it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

  1. server.start()
  2. Retrieve the current health status
  3. If the status is RED, I log this fact and wait for at least YELLOW
    status
  4. Check the current health status
  5. Log the "final" status
  6. If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I
assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

  1. Connects to 10.86.201.157
  2. Receives a response from 10.86.201.157
  3. Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

(Shay Banon) #17

Yes, the masterNodeTimeout can be set to have a timeout in case the master has not yet been detected. Its actually mainly used internally, and not actually "exposed" (in the REST API for example), though it does have javadoc.

On Thursday, June 16, 2011 at 3:42 AM, James Cook wrote:

Perhaps that is why you added a "masterNodeTimeout" property? Can I use this somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Yea... . I was actually thinking about this way back when health API was introduced, and considered treating it as RED status, but I think its a different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing as a RED status and continue looping (with some delay) until I get at least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (https://gist.github.com/977580)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
https://gist.github.com/1019038

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(James Cook) #18

I set the master timeout (java api) but didn't see any appreciable
difference. I am still getting master not found exceptions at startup.

The stranger thing is I wrapped all of my calls in try/catch blocks. I think
the exception is being thrown by ES internals.
On Jun 16, 2011 7:00 AM, "Shay Banon" shay.banon@elasticsearch.com wrote:

Yes, the masterNodeTimeout can be set to have a timeout in case the master
has not yet been detected. Its actually mainly used internally, and not
actually "exposed" (in the REST API for example), though it does have
javadoc.

On Thursday, June 16, 2011 at 3:42 AM, James Cook wrote:

Perhaps that is why you added a "masterNodeTimeout" property? Can I use
this somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon <shay.banon@elasticsearch.com(mailto:
shay.banon@elasticsearch.com)> wrote:

Yea... . I was actually thinking about this way back when health API
was introduced, and considered treating it as RED status, but I think its a
different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same
thing as a RED status and continue looping (with some delay) until I get at
least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to
get it. If its not available, then you will get the exception. You should
treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (https://gist.github.com/977580)), I
perform the following:

server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least
YELLOW status

Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED
status, I assume that it is the call in #2 which is the operation to which
you allude.

What best practice should I be following to ensure the cluster is
up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems
like it tries to do an operation, and because the ping_timeout is longer, it
will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a
MasterNotFoundException. Here are the gists in that case:

Node1: https://gist.github.com/1023472
Node2: https://gist.github.com/1023471

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first
node on the first round of ping and identified correctly that it should
become master, and then went into a second round of pings to wait for it
(the first node) to become master, failed to get a proper response from it
on the second round.

My guess is that the concurrency problems in making a
connection to a node have a play in that (there are non elasticsearch nodes
in the pool as well). This was fixed in master, but, you can work around it
by specifying a longer ping timeout. This should work. I see that the logs
indicate 3 second ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: https://gist.github.com/1023225
Node 2: https://gist.github.com/1023232

About two pages into each gist, you will see the ES
configuration parameters I am using. Both servers are deploying the exact
same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and
gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both
nodes reporting MasterNotFoundExcpetions. If I start the nodes up
sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the
ping timeout in this case. Its ok not to get a response from another node
while its starting up, and there is a window where they wait for nodes to
start up.

You can set discovery.ec2.ping_timeout to a higher
value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook
wrote:

My problem may be caused because both nodes boot
up approximately at the same time. If I start one node, then wait a few
minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control
over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <
jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]

[Sikorsky] [1] connecting to
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]

[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[

ping_response{target [

[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],

master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [

[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],

master
[[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],

cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]

[Sikorsky] received ping response with no
matching id [1]

The EC2 discovery process correctly finds my
other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My
primary node then:

Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no
matching id [1]" the reason for the disconnect? If so, what does "id" refer
to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <
jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2
clustering using 0.16.0.

I am starting two nodes on EC2, and during
the discovery process I see this log:

https://gist.github.com/1019038

My two nodes do not discover each other and
fail to cluster. Ports are open, and the configuration is identical between
the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) :
elasticsearch-dev

discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310


(system) #19