Inconsistent search cluster status and search results after long GC run

Hi,

Multiple times we ran into a problem where our search cluster was in an
inconsistent state. We have 3 nodes (all running 1.0.1), where nodes 2+3
hold the data (all the shards each, i.e. one replica per shard). Sometimes,
a long GC run happens on one of the nodes (here on node 3), causing it to
disconnect because the GC took longer than the timeout (here GC took 35.1s
and our timeout is currently 9s):

NODE 1
[2014-03-27 00:55:41,032][WARN ][discovery.zen ] [node1]
received cluster state from
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
which is also master but with an older cluster_state, telling
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
to rejoin the cluster
[2014-03-27 00:55:41,033][WARN ][discovery.zen ] [node1] failed
to send rejoin request to
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException:
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[node2][inet[/10.216.32.81:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 7 more
[2014-03-27 01:54:45,722][WARN ][discovery.zen ] [node1]
received cluster state from
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
which is also master but with an older cluster_state, telling
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
to rejoin the cluster
[2014-03-27 01:54:45,723][WARN ][discovery.zen ] [node1] failed
to send rejoin request to
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException:
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[node2][inet[/10.216.32.81:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 7 more
[2014-03-27 07:19:02,889][WARN ][discovery.zen ] [node1]
received cluster state from
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
which is also master but with an older cluster_state, telling
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
to rejoin the cluster
[2014-03-27 07:19:02,889][WARN ][discovery.zen ] [node1] failed
to send rejoin request to
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}]
org.elasticsearch.transport.SendRequestTransportException:
[node2][inet[/10.216.32.81:9300]][discovery/zen/rejoin]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:173)
at
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:556)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[node2][inet[/10.216.32.81:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:859)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:540)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:189)
... 7 more

NODE 2
[2014-03-27 07:19:02,871][INFO ][cluster.service ] [node2] removed
{[node3][RRqWlTWnQ7ygvsOaJS0_mA][node3][inet[/10.235.38.84:9300]]{master=true},},
reason:
zen-disco-node_failed([node3][RRqWlTWnQ7ygvsOaJS0_mA][node3][inet[/10.235.38.84:9
300]]{master=true}), reason failed to ping, tried [2] times, each with
maximum [9s] timeout

NODE 3
[2014-03-27 07:19:20,055][WARN ][monitor.jvm ] [node3]
[gc][old][539697][754] duration [35.1s], collections [1]/[35.8s], total
[35.1s]/[2.7m], memory [4.9gb]->[4.2gb]/[7.9gb], all_pools {[young]
[237.8mb]->[7.4mb]/[266.2mb]}{[survivor] [25.5mb]->[0b]/[33
.2mb]}{[old] [4.6gb]->[4.2gb]/[7.6gb]}
[2014-03-27 07:19:20,112][INFO ][discovery.zen ] [node3]
master_left
[[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true}],
reason [do not exists on master, act as master failure]
[2014-03-27 07:19:20,117][INFO ][cluster.service ] [node3] master
{new
[node1][DxlcpaqOTmmpNSRoqt1sZg][node1.example][inet[/10.252.78.88:9300]]{data=false,
master=true}, previous
[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300
]]{master=true}}, removed
{[node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true},},
reason: zen-disco-master_failed
([node2][A45sMYqtQsGrwY5exK0sEg][node2][inet[/10.216.32.81:9300]]{master=true})

After this scenario, the cluster doesn't recover properly: The worst thing
is that node 1 sees nodes 1+3, node 2 sees nodes 1+2 and node 3 sees nodes
1+3. Since the cluster is set up to operate with two nodes, both data nodes
2 and 3 accept data and searches, causing inconsistent results and
requiring us to do a full cluster restart and reindex all production data
to make sure the cluster is consistent again.

NODE 1 (GET /_nodes):
{
"cluster_name" : "elasticsearch",
"nodes" : {
"DxlcpaqOTmmpNSRoqt1sZg" : {
"name" : "node1",
...
},
"RRqWlTWnQ7ygvsOaJS0_mA" : {
"name" : "node3",
...
}
}
}

NODE 2 (GET /_nodes):
{
"cluster_name" : "elasticsearch",
"nodes" : {
"A45sMYqtQsGrwY5exK0sEg" : {
"name" : "node2",
...
},
"DxlcpaqOTmmpNSRoqt1sZg" : {
"name" : "node1",
...
}
}
}

NODE 3 (GET /_nodes):
{
"cluster_name" : "elasticsearch",
"nodes" : {
"DxlcpaqOTmmpNSRoqt1sZg" : {
"name" : "node1",
...
},
"RRqWlTWnQ7ygvsOaJS0_mA" : {
"name" : "node3",
...
}
}
}

Here are the configurations:

BASE CONFIG (for all nodes):
action:
disable_delete_all_indices: true
discovery:
zen:
fd:
ping_retries: 2
ping_timeout: 9s
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: ["node1.example", "node2.example", "node3.example"]
index:
fielddata:
cache: node
indices:
fielddata:
cache:
size: 40%
memory:
index_buffer_size: 20%
threadpool:
bulk:
queue_size: 100
type: fixed
transport:
tcp:
connect_timeout: 3s

NODE 1:
node:
data: false
master: true
name: node1

NODE 2:
node:
data: true
master: true
name: node2

NODE 3:
node:
data: true
master: true
name: node3

Questions:

  1. What can we do to minimize long GC runs, so the nodes don't become
    unresponsive and disconnect in the first place? (FYI: Our index is
    currently about 80 GB in size with over 2M docs (per node), 60 shards, heap
    size 8 GB. We run both searches and aggregations on it.)
  2. Obviously, having the cluster state in a state like the above is
    unacceptable and we therefore want to make sure that even if a node
    disconnects because of GC, the cluster can fully recover and only one of
    the two data nodes can accept data and searches while a node is
    disconnected. Is there anything that needs to be changed in the
    Elasticsearch code to fix this issue?

Thanks,
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d04f9e9-541d-4440-b874-143564c6ecdb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I would probably not master enable any node that can potentially gc for a
couple seconds. You want your master-eligible nodes to make decisions as
quick as possible.

About your GC situation, I'd find out what the underlying cause is:

  1. Do you have bootstrap.mlockall set to true?

  2. Does it usually triggered while running queries? Or is there a pattern
    on when it usually triggers?

  3. Is there anything else running on these nodes that would overload and
    affect normal ES operations?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cd594a91-00c4-43ae-97d8-bbda35618d8e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It seems you run into trouble because you changed some of the default
settings, worsening your situation.

Increase ping_timout from 9s to 60s as first band aid - you have GCs with
35secs running.

You should reduce the bulk thread pool of 100 to 50, this reduces high
memory pressure on the 20% memory you allow. Give more heap space to
indexing, use 50% instead of 20%.

Better help would be to diagnose the nodes if you exceed the capacity for
search and index operations. If so, think about adding nodes.

More finetuning after adding nodes could include G1 GC with Java 8, which
is targeted to minimize GC stalls. This would not solve node capacity
problems though.

Jörg

On Thu, Mar 27, 2014 at 4:46 PM, Binh Ly binhly_es@yahoo.com wrote:

I would probably not master enable any node that can potentially gc for a
couple seconds. You want your master-eligible nodes to make decisions as
quick as possible.

About your GC situation, I'd find out what the underlying cause is:

  1. Do you have bootstrap.mlockall set to true?

  2. Does it usually triggered while running queries? Or is there a pattern
    on when it usually triggers?

  3. Is there anything else running on these nodes that would overload and
    affect normal ES operations?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cd594a91-00c4-43ae-97d8-bbda35618d8e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/cd594a91-00c4-43ae-97d8-bbda35618d8e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFGS59bM%2Bk-%2BcPeMTwhc5DahP5N63ZDRrgUWMZPzZGRcA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Jörg,

I can increase the ping_timeout to 60s for now. However, shouldn't the goal
be to minimize the time GC runs? Is the node blocked when GC runs and will
delay any requests to it? If so, then it would be very bad to allow long GC
runs.

Regarding the bulk thread pool: I specifically set this to a higher value
to avoid errors when we perform bulk indexing (we had errors sometimes when
the queue was full and set to 50. I was also going to increase the "index"
queue since there are sometimes errors). I will try keeping the limit and
give it more heap space to indexing instead, as you suggested.

Regarding Java 8: We're currently running Java 7 and haven't tweaked any GC
specific settings. Do you think it makes sense to already switch to Java 8
on production and enable the G1 garbage collector?

Thanks again,
Thomas

On Thursday, March 27, 2014 9:41:10 PM UTC+1, Jörg Prante wrote:

It seems you run into trouble because you changed some of the default
settings, worsening your situation.

Increase ping_timout from 9s to 60s as first band aid - you have GCs with
35secs running.

You should reduce the bulk thread pool of 100 to 50, this reduces high
memory pressure on the 20% memory you allow. Give more heap space to
indexing, use 50% instead of 20%.

Better help would be to diagnose the nodes if you exceed the capacity for
search and index operations. If so, think about adding nodes.

More finetuning after adding nodes could include G1 GC with Java 8, which
is targeted to minimize GC stalls. This would not solve node capacity
problems though.

Jörg

On Thu, Mar 27, 2014 at 4:46 PM, Binh Ly <binh...@yahoo.com <javascript:>>wrote:

I would probably not master enable any node that can potentially gc for a
couple seconds. You want your master-eligible nodes to make decisions as
quick as possible.

About your GC situation, I'd find out what the underlying cause is:

  1. Do you have bootstrap.mlockall set to true?

  2. Does it usually triggered while running queries? Or is there a pattern
    on when it usually triggers?

  3. Is there anything else running on these nodes that would overload and
    affect normal ES operations?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cd594a91-00c4-43ae-97d8-bbda35618d8e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/cd594a91-00c4-43ae-97d8-bbda35618d8e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/86db1b12-038f-47d6-9fac-9e8eb8314dbc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Forgot to reply to your questions, Binh:

  1. No I haven't set this. However I wonder if this has any significant
    effect since swap space is barely used.
  2. It seems to happen when the cluster is under high load but I haven't
    seen any specific pattern so far.
  3. No there's not. There's a very small Redis instance running on node1,
    but there's nothing else on the nodes with shards (where the GC problem
    happens).

If I was going to disable master on any node that has shards I'd have to
add another dummy node with master:true so the cluster is in good state if
any one of the nodes is down.

On Thursday, March 27, 2014 4:46:41 PM UTC+1, Binh Ly wrote:

I would probably not master enable any node that can potentially gc for a
couple seconds. You want your master-eligible nodes to make decisions as
quick as possible.

About your GC situation, I'd find out what the underlying cause is:

  1. Do you have bootstrap.mlockall set to true?

  2. Does it usually triggered while running queries? Or is there a pattern
    on when it usually triggers?

  3. Is there anything else running on these nodes that would overload and
    affect normal ES operations?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0ae93e7c-a6f7-4784-8b4a-71d6f52552a7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.