Hi Drew,
I tested again today. I stopped both nodes and started the first machine
(EU) which just had a config of:
cluster.name: esdemohs
Nothing else, just the cluster itself. My logic was that i wanted to start
up this node, put some data in it, verify it was all ok and then get
another machine (US node) to join:
I went to the root url and got a 200, also /_cluster/nodes shows that it
was ok (one machine in its own cluster):
{
- ok: true,
- cluster_name: "esdemohs",
- nodes:
{
- NEv9k_O4Q0SoUUKWXn0CXg:
{
- name: "Helio",
- transport_address: "inet[/10.239.74.227:9300]",
- hostname: "ip-10-239-74-227",
- http_address: "inet[/10.239.74.227:9200]"
}
}
}
I then put in data, verified that it was all searchable and all was ok
(e.g. just testing that the single instance was ok).
I then went to the US machine and configured it like you said:
cluster.name: esdemohs
cloud.aws.access_key:
cloud.aws.secret_key:
cloud.aws.region: eu-west
discovery.type: ec2
discovery.ec2.host_type: public_ip
discovery.ec2.tag.cluster: esdemohs
discovery.ec2.ping_timeout: 5m
discovery.zen.ping_timeout: 5m
discovery.ec2.groups: sg-08895c7f # also tried putting this in an array and
using the US machines security group as well
In the output, i get the cryptic line:
[2012-10-02 11:18:48,306][TRACE][discovery.ec2 ] [Hildegarde]
filtering out instance i-cade1081 based on groups [{GroupName: HS-ES-AWS,
GroupId: sg-08895c7f, }], not part of [sg-08895c7f]
Seems to be telling me that i can't join the security group that i'm
explicitly telling it to be a part of (i expected the other way around!)
If i remove the security groups, it finds the correct node (EU machine),
recieves a response and then disconnects:
[2012-10-02 11:11:06,554][TRACE][discovery.ec2 ] [Gauntlet]
building dynamic unicast discovery nodes...
[2012-10-02 11:11:06,555][TRACE][discovery.ec2 ] [Gauntlet]
filtering out instance i-9f8fcfd7 based tags {cluster=esdemohs}, not part
of [{Key: Name, Value: HS-Dublin, }]
[2012-10-02 11:11:06,555][TRACE][discovery.ec2 ] [Gauntlet]
filtering out instance i-fbdf8bb3 based tags {cluster=esdemohs}, not part
of [{Key: Name, Value: Dublin-2Ports, }]
[2012-10-02 11:11:06,671][TRACE][discovery.ec2 ] [Gauntlet]
adding i-cade1081, address ec2-46-137-44-113.eu-west-1.compute.amazonaws.com,
transport_address inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]
[2012-10-02 11:11:06,674][DEBUG][discovery.ec2 ] [Gauntlet]
using dynamic discovery nodes [[#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]]
[2012-10-02 11:11:06,678][TRACE][discovery.zen.ping.unicast] [Gauntlet] [1]
connecting (light) to [#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]
[2012-10-02 11:11:06,868][DEBUG][transport.netty ] [Gauntlet]
connected to node [[#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]]
[2012-10-02 11:11:06,869][TRACE][discovery.zen.ping.unicast] [Gauntlet] [1]
connected to [#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]
[2012-10-02 11:11:06,869][TRACE][discovery.zen.ping.unicast] [Gauntlet] [1]
sending to [#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]
[2012-10-02 11:11:06,987][TRACE][discovery.zen.ping.unicast] [Gauntlet] [1]
received response from [#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]:
[ping_response{target
[[Gauntlet][iuj9R7MOTrK_Rsq_Vc7V7w][inet[/10.214.27.146:9300]]], master
[null], cluster_name[esdemohs]}, ping_response{target
[[Helio][NEv9k_O4Q0SoUUKWXn0CXg][inet[/10.239.74.227:9300]]], master
[[Helio][NEv9k_O4Q0SoUUKWXn0CXg][inet[/10.239.74.227:9300]]],
cluster_name[esdemohs]}]
[2012-10-02 11:11:34,559][WARN ][discovery ] [Gauntlet]
waited for 30s and no initial state was set by the discovery
[2012-10-02 11:11:34,560][INFO ][discovery ] [Gauntlet]
esdemohs/iuj9R7MOTrK_Rsq_Vc7V7w
[2012-10-02 11:11:34,560][DEBUG][gateway ] [Gauntlet]
can't wait on start for (possibly) reading state from gateway, will do it
asynchronously
[2012-10-02 11:11:34,569][INFO ][http ] [Gauntlet]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
10.214.27.146:9200]}
[2012-10-02 11:11:34,570][INFO ][node ] [Gauntlet]
{0.19.9}[6286]: started
Really unsure how to bring this any further. It's almost like the nodes are
being found and then decide not to do any connection.
When this line occurs on the US machine (ran again, hence the different
server name):
[2012-10-02 11:22:28,947][TRACE][discovery.zen.ping.unicast] [Firebird] [1]
received response from [#cloud-i-cade1081-0][inet[
ec2-46-137-44-113.eu-west-1.compute.amazonaws.com/46.137.44.113:9300]]:
[ping_response{target
[[Gauntlet][iuj9R7MOTrK_Rsq_Vc7V7w][inet[/10.214.27.146:9300]]], master
[null], cluster_name[esdemohs]}, ping_response{target
[[Gauntlet][iuj9R7MOTrK_Rsq_Vc7V7w][inet[/10.214.27.146:9300]]], master
[null], cluster_name[esdemohs]}, ping_response{target
[[Gauntlet][iuj9R7MOTrK_Rsq_Vc7V7w][inet[/10.214.27.146:9300]]], master
[null], cluster_name[esdemohs]}, ping_response{target
[[Firebird][7TNInTRSSo21NnsfBdEzFg][inet[/10.214.27.146:9300]]], master
[null], cluster_name[esdemohs]}, ping_response{target
[[Helio][NEv9k_O4Q0SoUUKWXn0CXg][inet[/10.239.74.227:9300]]], master
[[Helio][NEv9k_O4Q0SoUUKWXn0CXg][inet[/10.239.74.227:9300]]],
cluster_name[esdemohs]}]
The EU machine shows:
[2012-10-02 11:22:29,126][TRACE][transport.netty ] [Helio] channel
opened: [id: 0x502a3135, /50.16.164.109:*41404 *=> /10.239.74.227:9300]
I don't think outbound ports are being limited (only inbound) in AWS. The
US machine does not elect it's own master as it seems to be looking up the
EU-WEST region and not find anything. Any request to the server shows a
status 503 (master not found)
my rootLogger is already on TRACE, i didn't notice any relevant additional
output.
Derry
On 1 October 2012 19:29, Derry O' Sullivan derryos@gmail.com wrote:
HI Drew,
Thanks again for the response. This is my entire config - everything else
is the default (commented out).
I didn't include the filtered out instances lines (we have a lot of other
instances) but the output clearly shows
a) The nodes making requests to EC2 to find the other nodes
b) The nodes finding the other server in the other zone(e.g. EU finding us
and US finding EU e.g:
[2012-10-01 10:27:41,249][TRACE][discovery.ec2 ] [Demiurge]
adding i-b7eb27ca, address ec2-107-22-26-95.compute-1.amazonaws.com,
transport_address inet[
ec2-107-22-26-95.compute-1.amazonaws.com/107.22.26.95:9300]
[2012-10-01 10:27:41,250][DEBUG][discovery.ec2 ] [Demiurge]
using dynamic discovery nodes [[#cloud-i-b7eb27ca-0][inet[
ec2-107-22-26-95.compute-1.amazonaws.com/107.22.26.95:9300]http://ec2-107-22-26-95.compute-1.amazonaws.com/107.22.26.95:9300]
]]
[2012-10-01 10:29:02,400][TRACE][discovery.ec2 ] [Magnum I]
adding i-cade1081, address
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com, transport_address
inet[
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]
[2012-10-01 10:29:02,400][DEBUG][discovery.ec2 ] [Magnum I]
using dynamic discovery nodes [[#cloud-i-cade1081-0][inet[
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]http://ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]
]]
c) both nodes specifically connecting to their tag/cluster matching node
in the other zone, but instantly disconnect:
[2012-10-01 10:29:02,498][TRACE][discovery.zen.ping.unicast] [Magnum I]
[13] received response from [#cloud-i-cade1081-0][inet[
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]http://ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]]:
[ping_response{target [[Magnum
I][Xn8IL89ESJyij3H6DYjKhw][inet[/10.214.221.68:9300]]], master [null],
cluster_name[esdemohs]}, ping_response{target [[Magnum
I][Xn8IL89ESJyij3H6DYjKhw][inet[/10.214.221.68:9300]]], master [null],
cluster_name[esdemohs]}, ping_response{target [[Magnum
I][Xn8IL89ESJyij3H6DYjKhw][inet[/10.214.221.68:9300]]], master [null],
cluster_name[esdemohs]}, ping_response{target
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]], master
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]],
cluster_name[esdemohs]}]
[2012-10-01 10:29:02,498][TRACE][discovery.ec2 ] [Magnum I]
full ping responses:
--> target
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]], master
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]]
[2012-10-01 10:29:02,499][DEBUG][discovery.ec2 ] [Magnum I]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]], master
[[Demiurge][EIUweoZ6QWmq1cVCSaay3w][inet[/10.234.254.251:9300]]]
[2012-10-01 10:29:02,501][TRACE][discovery.zen.ping.unicast] [Magnum I]
[13] disconnecting from [#cloud-i-cade1081-0][inet[
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]]
[2012-10-01 10:29:02,501][DEBUG][transport.netty ] [Magnum I]
disconnected from [[#cloud-i-cade1081-0][inet[
ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]http://ec2-176-34-212-181.eu-west-1.compute.amazonaws.com/176.34.212.181:9300]
]]
It is saying filter_client true - would that mean that it thinks the other
instance is a client?
I don't have access to machine now but i'll try and get the full logs up
on pastebin or something tomorrow.
Thanks,
D
On 1 October 2012 18:54, Drew Raines aaraines@gmail.com wrote:
Derry O' Sullivan wrote:
The config of my first server (eu):
cluster.name: esdemohs
cloud.aws.access_key:
cloud.aws.secret_key:
cloud.aws.region: us-east-1 # not sure if i have to tell
discovery.type: ec2
discovery.ec2.host_type: public_dns
discovery.ec2.ping.timeout: 5m
discovery.ec2.tag.cluster: esdemohs
config of my 2nd server (US):
cluster.name: esdemohs
cloud.aws.access_key:
cloud.aws.secret_key:
cloud.aws.region: eu-west-1
discovery.type: ec2
discovery.ec2.ping.timeout: 5m
discovery.ec2.host_type: public_dns
discovery.ec2.tag.cluster: esdemohs
I start up both servers, i get a status 200 for the EU server and a
status
503 for the US server (web searching shows that that is because it is
not
able to join the cluster.
[...]
It seems that the config or cloud.aws.region or tags does not affect the
outcome, it just seems like they connect and then disconnect again.
No obvious issues with your config. Although it's interesting that
the us-east node didn't elect itself master also (maybe you have
master disabled on that one).
Feels more like a security group issue. Let's try turning on TRACE
logging for the us-east node. In logging.yml uncomment the line that
looks like
#discovery: TRACE
and restart the node. That will give us some output about what it's
trying to access. Look for lines like:
using host_type...
filtering out instance...
-Drew
--
--