Crazy behavior, every 3rd command fails

Maxwell_Flanders · November 17, 2016, 11:14pm

Earlier today, our elasticsearch cluster crashed (several data nodes went down). We brought it back and I'm now seeing some extremely strange behavior. The first symptom is that kibana is timing out on every search, it just says searching.... until it times out and fails.

I'm also seeing that when I run curl -s $(hostname -i):9200/_cat/nodes multiple times in a row, every third request, CONSISTENTLY, fails. I originally thought this to maybe map to a bad master node, but i did a rolling restart of the master nodes and they are all back online, in fact I can successfully run a health check against every node in the cluster.

Also, I tried restarting one data node and I'm seeing some strange behavior here as well. Initially, 100 shards went unassigned while it was offline (expected). Then, when I brought it back online, shards started to allocate again, and it recovered 2/3rds of the shards, and then froze, with 1/3rd shards remaining unassigned. Running curl -s $(hostname -i):9200/_cat/recovery?v shows no activity whatsoever.

Additionally, other commands (recovery, cat shards, cat indices) are getting intermittent failures (cat nodes is the only one that adheres to the strict 1/3rd rule), sometimes I will get responses back from the cluster instantaneously, and other times, the request will simply never come back, period.

Logs on the master show nothing critical, but occasionally, I do see things like this:

ReceiveTimeoutTransportException[[HOSTNAME][IP:9300][cluster:monitor/nodes/stats[n]] request_id [53823] timed out after [15000ms]]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:645)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

and this:

2016-11-17 22:50:55,340][ERROR][marvel.agent.collector.cluster] [HOSTNAME] collector [{}] timed out when collecting data

Despite all this, health checks on the cluster are coming back green and happy.

The cluster is pretty much unusable in this state. Does anyone have any possible ideas of what could be going on here?? Baffling.

Update: Now I have attempted to forcibly run a reroute api call on some of the shards that weren't moving, and most of them worked, but a few threw back this error:

[2016-11-17 23:28:54,275][DEBUG][action.admin.cluster.reroute] [INDEX] failed to perform [cluster_reroute (api)]
java.lang.IllegalArgumentException: [allocate] allocation of [INDEX][8] on node {HOSTNAME}{Arf3l7h6RsOL8ICMyZ15mw}{IP}{IP}{master=false} is not allowed, reason: [YES(below shard recovery limit of [5])][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [3.9tb])][YES(allocation disabling is ignored)][YES(shard not primary or relocation disabled)][NO(shard cannot be allocated on same node [Arf3l7h6RsOL8ICMyZ15mw] it already exists on)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.2.0] is same or newer than source node version [2.2.0])][YES(primary is already active)]
    at org.elasticsearch.cluster.routing.allocation.command.AllocateAllocationCommand.execute(AllocateAllocationCommand.java:220)
    at org.elasticsearch.cluster.routing.allocation.command.AllocationCommands.execute(AllocationCommands.java:116)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:185)
    at org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction$1.execute(TransportClusterRerouteAction.java:94)
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:458)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:762)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

system · December 15, 2016, 11:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Courier Fetch: 33 of 893 shards failed Kibana	6	6315	July 6, 2017
Elasticsearch-hadoop sporadic timeouts Elasticsearch	13	524	July 6, 2017
Node fails even after start/restart and it is not joining the cluster Elasticsearch	2	283	November 17, 2021
Timeouts in cluster management requests ES 7.11.2 leading to nodes in the cluster freezing Elasticsearch	8	1907	July 5, 2021
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1583	March 2, 2021

Crazy behavior, every 3rd command fails

Related topics