What's happend? my es cluster? plz help me


(이지홍) #1

my cluster is consist of 3 instance ip name 15~17
today in the morning. 17 instance was left the cluster
in the 15 instance elasticsearch-head plugin 17 instance stats is "Unassigned" 16 is can not find.
what's happend?
please somebody help me

  1. 17 instance log message.. in below..

[2014-04-20 03:29:28,539][INFO ][discovery.zen ] [10.32.240.17] master_left [[10.32.240.16] [YL2_5dVaTQ-_3Rvm1yKzoA] [net [/10.32.240.16:21001]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2014-04-20 03:29:28,540][INFO ][cluster.service ] [10.32.240.17] master {new [10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]], previous [10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]}, removed {[10.32.240.16][Y
L2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]],}, reason: zen-disco-master_failed ([10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]])
[2014-04-20 03:30:01,320][DEBUG][action.admin.cluster.node.stats] [10.32.240.17] failed to execute on node [a0qNnjLvQSauGEddNxKmNw]
org.elasticsearch.index.engine.EngineClosedException: [jp_listened_calcu_log][0] CurrentState[CLOSED]

    1. instance log message
      [2014-04-20 03:27:18,747][INFO ][discovery.zen ] [10.32.240.15] master_left [[10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
      [2014-04-20 03:27:18,757][INFO ][cluster.service ] [10.32.240.15] master {new [10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]], previous [10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]}, removed {[10.32.240.16][Y
      L2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]],}, reason: zen-disco-master_failed ([10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]])
      [2014-04-20 03:28:28,544][WARN ][transport ] [10.32.240.15] Received response for a request that has timed out, sent [68787ms] ago, timed out [38787ms] ago, action [discovery/zen/fd/masterPing], node [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][i net[/10.32.240.17:21001]]], id [10310608]
      [2014-04-20 03:28:28,544][WARN ][transport ] [10.32.240.15] Received response for a request that has timed out, sent [38787ms] ago, timed out [8787ms] ago, action [discovery/zen/fd/masterPing], node [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][in
      et[/10.32.240.17:21001]]], id [10310609]
      [2014-04-20 03:28:28,552][INFO ][discovery.zen ] [10.32.240.15] master_left [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]], reason [no longer master]
      [2014-04-20 03:28:28,557][INFO ][cluster.service ] [10.32.240.15] master {new [10.32.240.15][dE_q8O-dT-SeUlTBuM-yiQ][inet[/10.32.240.15:21001]], previous [10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]}, removed {[10.32.240.17][a
      0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]],}, reason: zen-disco-master_failed ([10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]])
      [2014-04-20 03:29:28,546][WARN ][discovery.zen ] [10.32.240.15] received cluster state from [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] which is also master but with an older cluster_state, telling [[10.32.240.17][a0qNnjL
      vQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] to rejoin the cluster
      [2014-04-20 03:29:28,548][WARN ][discovery.zen ] [10.32.240.15] failed to send rejoin request to [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]]
      org.elasticsearch.transport.SendRequestTransportException: [10.32.240.17][inet[/10.32.240.17:21001]][discovery/zen/rejoin]
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
      at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:541)
      at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:298)
      at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:135)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: org.elasticsearch.transport.NodeNotConnectedException: [10.32.240.17][inet[/10.32.240.17:21001]] Node not connected
      at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:834)
      at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:532)
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
      ... 7 more
      [2014-04-20 03:29:28,603][WARN ][discovery.zen ] [10.32.240.15] received cluster state from [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] which is also master but with an older cluster_state, telling [[10.32.240.17][a0qNnjL
      vQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] to rejoin the cluster
      [2014-04-20 03:29:28,604][WARN ][discovery.zen ] [10.32.240.15] failed to send rejoin request to [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]]
      org.elasticsearch.transport.SendRequestTransportException: [10.32.240.17][inet[/10.32.240.17:21001]][discovery/zen/rejoin]
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
      at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:541)
      at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:298)
      at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:135)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: org.elasticsearch.transport.NodeNotConnectedException: [10.32.240.17][inet[/10.32.240.17:21001]] Node not connected
      at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:834)
      at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:532)
      at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
      ... 7 more
      ~
  1. 17 instance elasticsearch process is alive

/usr/bin/java -Xms2G -Xmx2G -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch -Des.path.home=/home/irteam/apps/elasticsearch-0.90.7 -cp :/home/irteam/apps/elasticsearch-0.90.7/lib/elasticsearch-0.90.7.jar:/home/irteam/apps/elasticsearch-0.90.7/lib/:/home/irteam/apps/elasticsearch-0.90.7/lib/sigar/ org.elasticsearch.bootstrap.ElasticSearch

  1. configuration
    cluster.name: music-es-beta
    node.name: 10.32.240.15
    http.port: 21200
    transport.tcp.port: 21001
    multicast.enabled: false
    index.number_of_shards: 3
    index.number_of_replicas: 1
    index.mapper.dynamic: false
    action.auto_create_index: false
    bootstrap.mlockall: true
    discovery.zen.ping.timeout: 10s
    index.cache.field.type: soft
    discovery.zen.ping.unicast.hosts: ["10.32.240.15", "10.32.240.16","10.32.240.17"]

  2. how can i consist es-cluster? for fail-over and fail-back


(Binh Ly-2) #2

Could be something network related. From the logs, it looks like 16 dropped
out and then 17 and 15 decided that 17 is the new master. If you have not
added more data since, you can restart 16 and see if it joins back to the
cluster. Regardless, you probably want to set
discovery.zen.minimum_master_nodes: 2 for all your 3 nodes to ensure that
if a node drops out, it will not form a cluster by itself and continue to
accept requests.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/209b2d44-04dd-4c18-bec7-8b2b14b046dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #3

It looks like you lost connectivity between nodes, this may be due to GC.
Shutdown all your ndoes and then add this into your config

  • discovery.zen.minimum_master_nodes: 2. Then restart your cluster one node
    at a time.

Are you using anything like ElasticHQ, kopf or marvel to monitor things?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 20 April 2014 17:37, hongsgo hongsgo@gmail.com wrote:

my cluster is consist of 3 instance ip name 15~17
today in the morning. 17 instance was left the cluster
in the 15 instance elasticsearch-head plugin 17 instance stats is
"Unassigned" 16 is can not find.
what's happend?
please somebody help me

  1. 17 instance log message.. in below..

[2014-04-20 03:29:28,539][INFO ][discovery.zen ] [10.32.240.17]
master_left [[10.32.240.16] [YL2_5dVaTQ-_3Rvm1yKzoA] [net
[/10.32.240.16:21001]]], reason [failed to ping, tried [3] times, each
with
maximum [30s] timeout]
[2014-04-20 03:29:28,540][INFO ][cluster.service ] [10.32.240.17]
master {new
[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]],
previous
[10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]},
removed
{[10.32.240.16][Y
L2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]],}, reason:
zen-disco-master_failed
([10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]])
[2014-04-20 03:30:01,320][DEBUG][action.admin.cluster.node.stats]
[10.32.240.17] failed to execute on node [a0qNnjLvQSauGEddNxKmNw]
org.elasticsearch.index.engine.EngineClosedException:
[jp_listened_calcu_log][0] CurrentState[CLOSED]

    1. instance log message
      [2014-04-20 03:27:18,747][INFO ][discovery.zen ] [10.32.240.15]
      master_left
      [[10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]],
      reason
      [failed to ping, tried [3] times, each with maximum [30s] timeout]
      [2014-04-20 03:27:18,757][INFO ][cluster.service ] [10.32.240.15]
      master {new
      [10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]],
      previous
      [10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]]},
      removed
      {[10.32.240.16][Y
      L2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]],}, reason:
      zen-disco-master_failed
      ([10.32.240.16][YL2_5dVaTQ-_3Rvm1yKzoA][inet[/10.32.240.16:21001]])
      [2014-04-20 03:28:28,544][WARN ][transport ] [10.32.240.15]
      Received response for a request that has timed out, sent [68787ms] ago,
      timed out [38787ms] ago, action [discovery/zen/fd/masterPing], node
      [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][i net[/10.32.240.17:21001]]], id [10310608]
      [2014-04-20 03:28:28,544][WARN ][transport ] [10.32.240.15]
      Received response for a request that has timed out, sent [38787ms] ago,
      timed out [8787ms] ago, action [discovery/zen/fd/masterPing], node
      [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][in
      et[/10.32.240.17:21001]]], id [10310609]
      [2014-04-20 03:28:28,552][INFO ][discovery.zen ] [10.32.240.15]
      master_left
      [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]],
      reason
      [no longer master]
      [2014-04-20 03:28:28,557][INFO ][cluster.service ] [10.32.240.15]
      master {new
      [10.32.240.15][dE_q8O-dT-SeUlTBuM-yiQ][inet[/10.32.240.15:21001]],
      previous
      [10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]},
      removed
      {[10.32.240.17][a
      0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]],}, reason:
      zen-disco-master_failed
      ([10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]])
      [2014-04-20 03:29:28,546][WARN ][discovery.zen ] [10.32.240.15]
      received cluster state from
      [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] which
      is
      also master but with an older cluster_state, telling
      [[10.32.240.17][a0qNnjL
      vQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] to rejoin the cluster
      [2014-04-20 03:29:28,548][WARN ][discovery.zen ] [10.32.240.15]
      failed to send rejoin request to
      [[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]]
      org.elasticsearch.transport.SendRequestTransportException:
      [10.32.240.17][inet[/10.32.240.17:21001]][discovery/zen/rejoin]
      at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at

org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:541)
at

org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:298)
at

org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:135)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[10.32.240.17][inet[/10.32.240.17:21001]] Node not connected
at

org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:834)
at

org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:532)
at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 7 more
[2014-04-20 03:29:28,603][WARN ][discovery.zen ] [10.32.240.15]
received cluster state from
[[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] which
is
also master but with an older cluster_state, telling
[[10.32.240.17][a0qNnjL
vQSauGEddNxKmNw][inet[/10.32.240.17:21001]]] to rejoin the cluster
[2014-04-20 03:29:28,604][WARN ][discovery.zen ] [10.32.240.15]
failed to send rejoin request to
[[10.32.240.17][a0qNnjLvQSauGEddNxKmNw][inet[/10.32.240.17:21001]]]
org.elasticsearch.transport.SendRequestTransportException:
[10.32.240.17][inet[/10.32.240.17:21001]][discovery/zen/rejoin]
at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at

org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:541)
at

org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:298)
at

org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:135)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[10.32.240.17][inet[/10.32.240.17:21001]] Node not connected
at

org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:834)
at

org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:532)
at

org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 7 more
~

  1. 17 instance elasticsearch process is alive

/usr/bin/java -Xms2G -Xmx2G -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.path.home=/home/irteam/apps/elasticsearch-0.90.7 -cp

:/home/irteam/apps/elasticsearch-0.90.7/lib/elasticsearch-0.90.7.jar:/home/irteam/apps/elasticsearch-0.90.7/lib/:/home/irteam/apps/elasticsearch-0.90.7/lib/sigar/
org.elasticsearch.bootstrap.ElasticSearch

  1. configuration
    cluster.name: music-es-beta
    node.name: 10.32.240.15
    http.port: 21200
    transport.tcp.port: 21001
    multicast.enabled: false
    index.number_of_shards: 3
    index.number_of_replicas: 1
    index.mapper.dynamic: false
    action.auto_create_index: false
    bootstrap.mlockall: true
    discovery.zen.ping.timeout: 10s
    index.cache.field.type: soft
    discovery.zen.ping.unicast.hosts: ["10.32.240.15",
    "10.32.240.16","10.32.240.17"]

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/what-s-happend-my-es-es-cluster-plz-help-me-tp4054448.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1397979426164-4054448.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bXCnTTp_us3NPeeixWg2Un95%3DZCyQ%2BJ1oUziYLuiqvbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(이지홍) #4

thank you very much.

i have new questions.

first what is default value of discovery.zen.minimum_master_nodes?

second, about isolated 16 node.
client is reuqest to save at "abc index" of 16 node(not joined to cluster yet)
succeed to save.

after restart 16 node. then joined cluster.
abc index data is ok?
if not duplicate doc id. it would be ok?
after mix with 16, 17, 18 nodes


(Ivan Brusic) #5

There is no default value for minimum_master_nodes. If not set, the value
is not used to determine if the cluster is whole.

If the documents do not have a duplicate, they should be merged when the
node rejoins the cluster. If you set the minimum_master_nodes, the cluster
will not accept any document inserts if the cluster is red. The cluster
will be red if only one node is present (in order to prevent split brain).

Cheers,

Ivan

On Mon, Apr 28, 2014 at 12:40 AM, hongsgo hongsgo@gmail.com wrote:

thanks you very much.

i have new questions.

first what is default value of discovery.zen.minimum_master_nodes?

second, about isolated 16 node.
client is reuqest to save at "abc index" of 16 node(not joined to
cluster
yet)
succeed to save.

after restart 16 node. then joined cluster.
abc index data is ok?
if not duplicate doc id. it would be ok?
after mix with 16, 17, 18 nodes

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/what-s-happend-my-es-cluster-plz-help-me-tp4054448p4054890.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1398670831625-4054890.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQA4L%3DCiXNKQn6d_dQjZGoHc5JYSVaBkSojZfFPhZw7xNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6