Very Strange Master Node Issue - Closed nodes not being removed

Hi All,

One of our clusters had an oom issue (v0.20.5) then we have restarted it.
Unfortunately our 2 node cluster become 3, master just adds new nodes, if
we restart one other node.

For instance _cluster/state returns;

{

"cluster_name" : "searchbox",

"master_node" : "CMP9rjMJTJ-iukNTR4Lmvg",

"blocks" : { },

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"C9bJzhjlTLC9yUfkQOChhA" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"nJZlgVUlRpmm4tDTVXneTg" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

},

and we get logs

[2013-04-03 17:26:35,928][WARN ][cluster.service ] [Hornet] failed
to reconnect to node
[searchbox1][Wg3iZM88QsanRtaQFYbiYg][inet[/10.147.179.29:9300]]{tag=level0,
master=true}

org.elasticsearch.transport.ConnectTransportException:
[searchbox1][inet[/10.147.179.29:9300]] connect_timeout[30s]

But when I get _cluser/nodes

{

"ok" : true,

"cluster_name" : "searchbox",

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "hostname" : "ip-10-32-86-126",

  "version" : "0.20.5",

  "http_address" : "inet[/10.32.86.126:9200]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

}

}

Any idea what is happening ?

Ferhat

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One addition we did not restart master.

On Wednesday, April 3, 2013 8:35:01 PM UTC+3, ferhatsb wrote:

Hi All,

One of our clusters had an oom issue (v0.20.5) then we have restarted it.
Unfortunately our 2 node cluster become 3, master just adds new nodes, if
we restart one other node.

For instance _cluster/state returns;

{

"cluster_name" : "searchbox",

"master_node" : "CMP9rjMJTJ-iukNTR4Lmvg",

"blocks" : { },

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"C9bJzhjlTLC9yUfkQOChhA" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"nJZlgVUlRpmm4tDTVXneTg" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

},

and we get logs

[2013-04-03 17:26:35,928][WARN ][cluster.service ] [Hornet]
failed to reconnect to node
[searchbox1][Wg3iZM88QsanRtaQFYbiYg][inet[/10.147.179.29:9300]]{tag=level0,
master=true}

org.elasticsearch.transport.ConnectTransportException: [searchbox1][inet[/
10.147.179.29:9300]] connect_timeout[30s]

But when I get _cluser/nodes

{

"ok" : true,

"cluster_name" : "searchbox",

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "hostname" : "ip-10-32-86-126",

  "version" : "0.20.5",

  "http_address" : "inet[/10.32.86.126:9200]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

}

}

Any idea what is happening ?

Ferhat

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

only one node hit the OOM or both?

simon

On Wednesday, April 3, 2013 7:35:01 PM UTC+2, ferhatsb wrote:

Hi All,

One of our clusters had an oom issue (v0.20.5) then we have restarted it.
Unfortunately our 2 node cluster become 3, master just adds new nodes, if
we restart one other node.

For instance _cluster/state returns;

{

"cluster_name" : "searchbox",

"master_node" : "CMP9rjMJTJ-iukNTR4Lmvg",

"blocks" : { },

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"C9bJzhjlTLC9yUfkQOChhA" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"nJZlgVUlRpmm4tDTVXneTg" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

},

and we get logs

[2013-04-03 17:26:35,928][WARN ][cluster.service ] [Hornet]
failed to reconnect to node
[searchbox1][Wg3iZM88QsanRtaQFYbiYg][inet[/10.147.179.29:9300]]{tag=level0,
master=true}

org.elasticsearch.transport.ConnectTransportException: [searchbox1][inet[/
10.147.179.29:9300]] connect_timeout[30s]

But when I get _cluser/nodes

{

"ok" : true,

"cluster_name" : "searchbox",

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "hostname" : "ip-10-32-86-126",

  "version" : "0.20.5",

  "http_address" : "inet[/10.32.86.126:9200]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

}

}

Any idea what is happening ?

Ferhat

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon,
Only one node, updated to v0.20.6 and restarted master node then problem
faded.
After 5 days I'm not sure if this is exactly Elasticsearch issue at all.
We have some customizations/extensions over rest api unfortunately I
couldn't verify the root cause.
I will update you with better information (thread dumps etc) if it happens
again.

On Sunday, April 7, 2013 1:52:16 PM UTC+3, simonw wrote:

only one node hit the OOM or both?

simon

On Wednesday, April 3, 2013 7:35:01 PM UTC+2, ferhatsb wrote:

Hi All,

One of our clusters had an oom issue (v0.20.5) then we have restarted it.
Unfortunately our 2 node cluster become 3, master just adds new nodes, if
we restart one other node.

For instance _cluster/state returns;

{

"cluster_name" : "searchbox",

"master_node" : "CMP9rjMJTJ-iukNTR4Lmvg",

"blocks" : { },

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"C9bJzhjlTLC9yUfkQOChhA" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"nJZlgVUlRpmm4tDTVXneTg" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

},

and we get logs

[2013-04-03 17:26:35,928][WARN ][cluster.service ] [Hornet]
failed to reconnect to node
[searchbox1][Wg3iZM88QsanRtaQFYbiYg][inet[/10.147.179.29:9300]]{tag=level0,
master=true}

org.elasticsearch.transport.ConnectTransportException: [searchbox1][inet[/
10.147.179.29:9300]] connect_timeout[30s]

But when I get _cluser/nodes

{

"ok" : true,

"cluster_name" : "searchbox",

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "hostname" : "ip-10-32-86-126",

  "version" : "0.20.5",

  "http_address" : "inet[/10.32.86.126:9200]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

}

}

Any idea what is happening ?

Ferhat

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

thanks!

On Monday, April 8, 2013 8:01:16 AM UTC+2, ferhatsb wrote:

Hi Simon,
Only one node, updated to v0.20.6 and restarted master node then problem
faded.
After 5 days I'm not sure if this is exactly Elasticsearch issue at all.
We have some customizations/extensions over rest api unfortunately I
couldn't verify the root cause.
I will update you with better information (thread dumps etc) if it happens
again.

On Sunday, April 7, 2013 1:52:16 PM UTC+3, simonw wrote:

only one node hit the OOM or both?

simon

On Wednesday, April 3, 2013 7:35:01 PM UTC+2, ferhatsb wrote:

Hi All,

One of our clusters had an oom issue (v0.20.5) then we have restarted
it. Unfortunately our 2 node cluster become 3, master just adds new nodes,
if we restart one other node.

For instance _cluster/state returns;

{

"cluster_name" : "searchbox",

"master_node" : "CMP9rjMJTJ-iukNTR4Lmvg",

"blocks" : { },

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"C9bJzhjlTLC9yUfkQOChhA" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

},

"nJZlgVUlRpmm4tDTVXneTg" : {

  "name" : "searchbox1",

  "transport_address" : "inet[/10.147.179.29:9300]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

},

and we get logs

[2013-04-03 17:26:35,928][WARN ][cluster.service ] [Hornet]
failed to reconnect to node
[searchbox1][Wg3iZM88QsanRtaQFYbiYg][inet[/10.147.179.29:9300]]{tag=level0,
master=true}

org.elasticsearch.transport.ConnectTransportException:
[searchbox1][inet[/10.147.179.29:9300]] connect_timeout[30s]

But when I get _cluser/nodes

{

"ok" : true,

"cluster_name" : "searchbox",

"nodes" : {

"CMP9rjMJTJ-iukNTR4Lmvg" : {

  "name" : "Hornet",

  "transport_address" : "inet[/10.32.86.126:9300]",

  "hostname" : "ip-10-32-86-126",

  "version" : "0.20.5",

  "http_address" : "inet[/10.32.86.126:9200]",

  "attributes" : {

    "tag" : "level0",

    "master" : "true"

  }

}

}

}

Any idea what is happening ?

Ferhat

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.