Crazy behavior, every 3rd command fails

Earlier today, our elasticsearch cluster crashed (several data nodes went down). We brought it back and I'm now seeing some extremely strange behavior. The first symptom is that kibana is timing out on every search, it just says searching.... until it times out and fails.

I'm also seeing that when I run curl -s $(hostname -i):9200/_cat/nodes multiple times in a row, every third request, CONSISTENTLY, fails. I originally thought this to maybe map to a bad master node, but i did a rolling restart of the master nodes and they are all back online, in fact I can successfully run a health check against every node in the cluster.

Also, I tried restarting one data node and I'm seeing some strange behavior here as well. Initially, 100 shards went unassigned while it was offline (expected). Then, when I brought it back online, shards started to allocate again, and it recovered 2/3rds of the shards, and then froze, with 1/3rd shards remaining unassigned. Running curl -s $(hostname -i):9200/_cat/recovery?v shows no activity whatsoever.

Additionally, other commands (recovery, cat shards, cat indices) are getting intermittent failures (cat nodes is the only one that adheres to the strict 1/3rd rule), sometimes I will get responses back from the cluster instantaneously, and other times, the request will simply never come back, period.

Logs on the master show nothing critical, but occasionally, I do see things like this:

ReceiveTimeoutTransportException[[HOSTNAME][IP:9300][cluster:monitor/nodes/stats[n]] request_id [53823] timed out after [15000ms]]
        at org.elasticsearch.transport.TransportService$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

and this:

2016-11-17 22:50:55,340][ERROR][marvel.agent.collector.cluster] [HOSTNAME] collector [{}] timed out when collecting data

Despite all this, health checks on the cluster are coming back green and happy.

The cluster is pretty much unusable in this state. Does anyone have any possible ideas of what could be going on here?? Baffling.

Update: Now I have attempted to forcibly run a reroute api call on some of the shards that weren't moving, and most of them worked, but a few threw back this error:

[2016-11-17 23:28:54,275][DEBUG][action.admin.cluster.reroute] [INDEX] failed to perform [cluster_reroute (api)]
java.lang.IllegalArgumentException: [allocate] allocation of [INDEX][8] on node {HOSTNAME}{Arf3l7h6RsOL8ICMyZ15mw}{IP}{IP}{master=false} is not allowed, reason: [YES(below shard recovery limit of [5])][YES(node passes include/exclude/require filters)][YES(enough disk for shard on node, free: [3.9tb])][YES(allocation disabling is ignored)][YES(shard not primary or relocation disabled)][NO(shard cannot be allocated on same node [Arf3l7h6RsOL8ICMyZ15mw] it already exists on)][YES(total shard limit disabled: [index: -1, cluster: -1] <= 0)][YES(no allocation awareness enabled)][YES(allocation disabling is ignored)][YES(target node version [2.2.0] is same or newer than source node version [2.2.0])][YES(primary is already active)]
    at org.elasticsearch.cluster.routing.allocation.command.AllocateAllocationCommand.execute(
    at org.elasticsearch.cluster.routing.allocation.command.AllocationCommands.execute(
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(
    at org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction$1.execute(
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(
    at org.elasticsearch.cluster.service.InternalClusterService$
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.