EC2 Cluster Failure

James_Cook · June 10, 2011, 3:34pm

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:

gist.github.com

https://gist.github.com/oravecz/1019038

gistfile1.txt

15:01:57,475 DEBUG thread-9 arch.transport.netty:  70 - [Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
15:01:57,475 TRACE thread-9 ery.zen.ping.unicast:  62 - [Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
15:01:57,477 TRACE ker #1-2 ery.zen.ping.unicast:  62 - [Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]: [ping_response{target [[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]], master [null], cluster_name[elasticsearch-dev]}, ping_response{target [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]], master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]], cluster_name[elasticsearch-dev]}]
15:01:57,477 DEBUG ker #1-2 arch.transport.netty:  70 - [Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
15:01:57,477  WARN ker #1-2 ery.zen.ping.unicast:  86 - [Sikorsky] received ping response with no matching id [1]
15:02:18,474 TRACE thread-8 ery.zen.ping.unicast:  66 - [Sikorsky] [1] failed to connect to [#cloud-i-5b553335-0][inet[/10.254.91.0:9310]]
org.elasticsearch.transport.ConnectTransportException: [][inet[/10.254.91.0:9310]] connect_timeout[30s]
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:512)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:473)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

This file has been truncated. show original

My two nodes do not discover each other and fail to cluster. Ports are open,
and the configuration is identical between the nodes:

 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

James_Cook · June 10, 2011, 5:02pm

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310
]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310
]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason
for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.com wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

James_Cook · June 10, 2011, 6:03pm

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook jcook@tracermedia.com wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.comwrote:
I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see
this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 12, 2011, 7:21am

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook_2 · June 12, 2011, 11:53pm

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).
On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:
My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook jcook@tracermedia.com wrote:
A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157
Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook jcook@tracermedia.com wrote:
I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 13, 2011, 12:07am

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 13, 2011, 5:23pm

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters
I am using. Both servers are deploying the exact same WAR file with an
Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038 gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 13, 2011, 6:28pm

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 13, 2011, 7:12pm

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038 gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 14, 2011, 5:30pm

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 14, 2011, 6:01pm

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW
status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume
that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038 gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 14, 2011, 6:21pm

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (ElasticSearchServer.java · GitHub)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

kimchy · June 14, 2011, 6:25pm

Btw, worked on improving the unicast discovery in master, on top of the concurrency improvements when connecting, its now much more lightweight.

On Tuesday, June 14, 2011 at 9:21 PM, Shay Banon wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (ElasticSearchServer.java · GitHub)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 14, 2011, 6:47pm

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing
as a RED status and continue looping (with some delay) until I get at least
a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The health API needs to find hte master of the cluster in order to get
it. If its not available, then you will get the exception. You should treat
it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

server.start()

Retrieve the current health status

If the status is RED, I log this fact and wait for at least YELLOW
status

Check the current health status

Log the "final" status

If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I
assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038 gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 14, 2011, 10:06pm

Yea... . I was actually thinking about this way back when health API was introduced, and considered treating it as RED status, but I think its a different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing as a RED status and continue looping (with some delay) until I get at least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (ElasticSearchServer.java · GitHub)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 16, 2011, 12:42am

Perhaps that is why you added a "masterNodeTimeout" property? Can I use this
somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yea... . I was actually thinking about this way back when health API was
introduced, and considered treating it as RED status, but I think its a
different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing
as a RED status and continue looping (with some delay) until I get at least
a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The health API needs to find hte master of the cluster in order to get
it. If its not available, then you will get the exception. You should treat
it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code https://gist.github.com/977580), I perform
the following:

server.start()

Retrieve the current health status

If the status is RED, I log this fact and wait for at least YELLOW
status

Check the current health status

Log the "final" status

If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I
assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and
ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

What operation do you do when you create the node? It seems like it tries
to do an operation, and because the ping_timeout is longer, it will not be
able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here
are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

It seems like the second node, which detected the first node on the first
round of ping and identified correctly that it should become master, and
then went into a second round of pings to wait for it (the first node) to
become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a
node have a play in that (there are non elasticsearch nodes in the pool as
well). This was fixed in master, but, you can work around it by specifying a
longer ping timeout. This should work. I see that the logs indicate 3 second
ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration
parameters I am using. Both servers are deploying the exact same WAR file
with an Embedded ES server.

-- jim
*
*
*
*
On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Can you set teh discovery logging level to TRACE and gist the logs of
both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting
MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no
problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon shay.banon@elasticsearch.com
wrote:

The option that you have here is to increase the ping timeout in this
case. Its ok not to get a response from another node while its starting up,
and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the
same time. If I start one node, then wait a few minutes before starting my
other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk
decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]],
disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310
]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157)
using the EC2 DescribeInstances API. My primary node then:

Connects to 10.86.201.157

Receives a response from 10.86.201.157

Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the
reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook < jcook@tracermedia.com
jcook@tracermedia.com> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this
log:
https://gist.github.com/1019038 gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are
open, and the configuration is identical between the nodes:
 cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
 cloud.aws.secret_key : <HIDDEN>
 cluster.name : elasticsearch-dev
 discovery.type : ec2
 discovery.zen.ping_timeout : 30s
 gateway.s3.bucket : ppkc-es-gateway-dev
 gateway.type : s3
 http.enabled : true
 http.port : 9311
 index.mapping._id.indexed : true
 index.store.type : niofs
 name : Sikorsky
 network.host : 0.0.0.0
 node.data : true
 path.data : /var/local/es/data
 transport.tcp.port : 9310

kimchy · June 16, 2011, 11:00am

Yes, the masterNodeTimeout can be set to have a timeout in case the master has not yet been detected. Its actually mainly used internally, and not actually "exposed" (in the REST API for example), though it does have javadoc.

On Thursday, June 16, 2011 at 3:42 AM, James Cook wrote:

Perhaps that is why you added a "masterNodeTimeout" property? Can I use this somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Yea... . I was actually thinking about this way back when health API was introduced, and considered treating it as RED status, but I think its a different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same thing as a RED status and continue looping (with some delay) until I get at least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to get it. If its not available, then you will get the exception. You should treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (ElasticSearchServer.java · GitHub)), I perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED status, I assume that it is the call in #2 which is the operation to which you allude.

What best practice should I be following to ensure the cluster is up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems like it tries to do an operation, and because the ping_timeout is longer, it will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first node on the first round of ping and identified correctly that it should become master, and then went into a second round of pings to wait for it (the first node) to become master, failed to get a proper response from it on the second round.

My guess is that the concurrency problems in making a connection to a node have a play in that (there are non elasticsearch nodes in the pool as well). This was fixed in master, but, you can work around it by specifying a longer ping timeout. This should work. I see that the logs indicate 3 second ping timeout, can you try and increate it by setting discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES configuration parameters I am using. Both servers are deploying the exact same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both nodes reporting MasterNotFoundExcpetions. If I start the nodes up sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the ping timeout in this case. Its ok not to get a response from another node while its starting up, and there is a window where they wait for nodes to start up.

You can set discovery.ec2.ping_timeout to a higher value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook wrote:

My problem may be caused because both nodes boot up approximately at the same time. If I start one node, then wait a few minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to [#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from [#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [
[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],
master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [
[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
master [[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from [[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no matching id [1]

The EC2 discovery process correctly finds my other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no matching id [1]" the reason for the disconnect? If so, what does "id" refer to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2 clustering using 0.16.0.

I am starting two nodes on EC2, and during the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and fail to cluster. Ports are open, and the configuration is identical between the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) : elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310

James_Cook · June 16, 2011, 1:48pm

I set the master timeout (java api) but didn't see any appreciable
difference. I am still getting master not found exceptions at startup.

The stranger thing is I wrapped all of my calls in try/catch blocks. I think
the exception is being thrown by ES internals.
On Jun 16, 2011 7:00 AM, "Shay Banon" shay.banon@elasticsearch.com wrote:

Yes, the masterNodeTimeout can be set to have a timeout in case the master
has not yet been detected. Its actually mainly used internally, and not
actually "exposed" (in the REST API for example), though it does have
javadoc.

On Thursday, June 16, 2011 at 3:42 AM, James Cook wrote:

Perhaps that is why you added a "masterNodeTimeout" property? Can I use
this somehow when I check the cluster health? No docs that I could find.

-- jim

On Tue, Jun 14, 2011 at 6:06 PM, Shay Banon <shay.banon@elasticsearch.com(mailto:
shay.banon@elasticsearch.com)> wrote:

Yea... . I was actually thinking about this way back when health API
was introduced, and considered treating it as RED status, but I think its a
different level of "status".

On Tuesday, June 14, 2011 at 9:47 PM, James Cook wrote:

Thanks Shay.

So the best practice is to treat MasterNotFoundException as the same
thing as a RED status and continue looping (with some delay) until I get at
least a YELLOW status?

-- jim

On Tue, Jun 14, 2011 at 2:21 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The health API needs to find hte master of the cluster in order to
get it. If its not available, then you will get the exception. You should
treat it, based on your logic, similar to a RED status.

On Tuesday, June 14, 2011 at 9:01 PM, James Cook wrote:

When I bootstrap ES (my code (ElasticSearchServer.java · GitHub)), I
perform the following:
server.start()
Retrieve the current health status
If the status is RED, I log this fact and wait for at least
YELLOW status
Check the current health status
Log the "final" status
If it is still RED, I throw an exception

Since I do not see my log statement (#3) indicating the RED
status, I assume that it is the call in #2 which is the operation to which
you allude.

What best practice should I be following to ensure the cluster is
up and ready to receive calls?

Thanks,
jim

On Tue, Jun 14, 2011 at 1:30 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

What operation do you do when you create the node? It seems
like it tries to do an operation, and because the ping_timeout is longer, it
will not be able to perform it because the discovery is not done yet.

On Monday, June 13, 2011 at 10:12 PM, James Cook wrote:

I had tried that earlier. Both nodes throw a
MasterNotFoundException. Here are the gists in that case:

Node1: gist:1023472 · GitHub
Node2: gist:1023471 · GitHub

On Mon, Jun 13, 2011 at 2:28 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Hey,

It seems like the second node, which detected the first
node on the first round of ping and identified correctly that it should
become master, and then went into a second round of pings to wait for it
(the first node) to become master, failed to get a proper response from it
on the second round.

My guess is that the concurrency problems in making a
connection to a node have a play in that (there are non elasticsearch nodes
in the pool as well). This was fixed in master, but, you can work around it
by specifying a longer ping timeout. This should work. I see that the logs
indicate 3 second ping timeout, can you try and increate it by setting
discovery.ec2.ping_timeout to something like 30s?

On Monday, June 13, 2011 at 8:23 PM, James Cook wrote:

Hi Shay,

Sure, thanks for taking a look.

Node 1: EC2 Startup - Node 1 · GitHub
Node 2: 1023232’s gists · GitHub

About two pages into each gist, you will see the ES
configuration parameters I am using. Both servers are deploying the exact
same WAR file with an Embedded ES server.

-- jim

On Sun, Jun 12, 2011 at 8:07 PM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Can you set teh discovery logging level to TRACE and
gist the logs of both nodes?

On Monday, June 13, 2011 at 2:56 AM, James Cook wrote:

When I increase the ping timeout, I end up with both
nodes reporting MasterNotFoundExcpetions. If I start the nodes up
sequentially, I have no problems. Perhaps there is a race condition at play?

This is 0.16.2.

Sent from my iPad

TOn Jun 12, 2011, at 3:21 AM, Shay Banon <
shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The option that you have here is to increase the
ping timeout in this case. Its ok not to get a response from another node
while its starting up, and there is a window where they wait for nodes to
start up.

You can set discovery.ec2.ping_timeout to a higher
value (defaults to 3s).

On Friday, June 10, 2011 at 9:03 PM, James Cook
wrote:

My problem may be caused because both nodes boot
up approximately at the same time. If I start one node, then wait a few
minutes before starting my other node, they cluster.

Unfortunately, I do not have a lot of control
over how Elastic Beanstalk decides to start my servers.

Any ideas what I can do to work around?

-- jim

On Fri, Jun 10, 2011 at 1:02 PM, James Cook <
jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

A little background, if needed.

[Sikorsky] Connected to node
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] [1] connecting to
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]], disconnect[true]
[Sikorsky] [1] received response from
[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]:[
ping_response{target [

[Sikorsky][aVe_2csgSoe2Ux0qv-qrCQ][inet[/10.223.62.165:9310]]],

master [null],
cluster_name[elasticsearch-dev]},
ping_response{target [

[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],

master
[[Maggott][Ad8I2KHqTQyQeCVtBnE9CQ][inet[/10.86.201.157:9310]]],
cluster_name[elasticsearch-dev]}]
[Sikorsky] Disconnected from
[[#cloud-i-59503637-0][inet[/10.86.201.157:9310]]]
[Sikorsky] received ping response with no
matching id [1]

The EC2 discovery process correctly finds my
other EC2 node (10.86.201.157) using the EC2 DescribeInstances API. My
primary node then:
Connects to 10.86.201.157
Receives a response from 10.86.201.157
Promptly disconnects from 10.86.201.157

Is that message "received ping response with no
matching id [1]" the reason for the disconnect? If so, what does "id" refer
to?

Thanks.

On Fri, Jun 10, 2011 at 11:34 AM, James Cook <
jcook@tracermedia.com (mailto:jcook@tracermedia.com)> wrote:

I'm experiencing a new problem with EC2
clustering using 0.16.0.

I am starting two nodes on EC2, and during
the discovery process I see this log:
gist:1019038 · GitHub

My two nodes do not discover each other and
fail to cluster. Ports are open, and the configuration is identical between
the nodes:

cloud.aws.access_key : AKIAJWQRTNTMFXIMX3WA
cloud.aws.secret_key :
cluster.name (http://cluster.name) :
elasticsearch-dev
discovery.type : ec2
discovery.zen.ping_timeout : 30s
gateway.s3.bucket : ppkc-es-gateway-dev
gateway.type : s3
http.enabled : true
http.port : 9311
index.mapping._id.indexed : true
index.store.type : niofs
name : Sikorsky
network.host : 0.0.0.0
node.data : true
path.data : /var/local/es/data
transport.tcp.port : 9310