Kibana flapping between red and green

Hi List,

Running an ES2.1 cluster with 4 nodes and Kibana 4.3. I am constantly seeing this when Kibana starts up:

  log   [09:39:20.395] [error][status][plugin:elasticsearch] Status changed from green to red - Request Timeout after 1500ms
  log   [09:39:24.367] [info][status][plugin:elasticsearch] Status changed from red to green - Kibana index ready
  log   [09:39:40.273] [error][status][plugin:elasticsearch] Status changed from green to red - Request Timeout after 1500ms
  log   [09:39:42.808] [info][status][plugin:elasticsearch] Status changed from red to green - Kibana index ready
  log   [09:40:15.028] [error][status][plugin:elasticsearch] Status changed from green to red - Request Timeout after 1500ms
  log   [09:40:17.557] [info][status][plugin:elasticsearch] Status changed from red to green - Kibana index ready
  log   [09:40:33.631] [error][status][plugin:elasticsearch] Status changed from green to red - Request Timeout after 1500ms
  log   [09:40:36.181] [info][status][plugin:elasticsearch] Status changed from red to green - Kibana index ready

Except one index, which is red, every other index on my ES cluster is green.. and its serving graphs to a Grafana endpoint just fine.

ES2 does not have Shield.

Any thoughts? or pointers? For starters, I have increased some time out values in kibana.yml.

Thanks

Anything in the ES logs?

Hi @warkolm,

Wanted to get back to you after updating my infra.

My ES infra now has 8 data nodes, 2 client nodes and 3 master nodes. Kibana 4.3 is pointing to one of the client nodes.

I still see the same message and Kibana connection is flapping with the new infra changes.

I removed deleted 'red' indices and now, I have:

{
  "cluster_name" : "live",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 13,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 1734,
  "active_shards" : 5188,
  "relocating_shards" : 2,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 1366,
  "active_shards_percent_as_number" : 100.0
}

I have made sure nothing is populating ES except Marvel (which is what I wanted to use Kibana for).

When I shut down my LS2 instance, I did see some shard allocation tracebacks. But will now observe for other logs. I am able to reach the Kibana's /app/marvel URL but the data there does not seem right there. For example, I see all the servers in the cluster show Shards as 0. Perhaps something else that I will investigate and post in the Marvel group, but holding off pending further investigation.

Thanks.

Interesting observation. I moved Kibana to one of my client nodes and dont see the issue anymore. Perhaps a network issue?

Update: spoke too soon. Same issue persists. So dont think its a network issue between the Kibana and the client nodes. Its something else.

State of my cluster:

{
  "cluster_name" : "sln-live",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 14,
  "number_of_data_nodes" : 9,
  "active_primary_shards" : 2075,
  "active_shards" : 6210,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Marvel was showing me cluster state, but now I see Marvel is not sending Kibana anything. All dashboards are empty.

Any help/thoughts? Place to investigate?

Are all your nodes on the same network?

Yeah, they are on the same network. Different subnets though, connected through more than one L2 switches.

Thanks.

Then chances are your network is flaky.
It may be worth zen increasing timeouts a little to see if it helps.

Hi mad_min

We had the same issue, the following solution with setting the memory limitation properly for the kibana node process seems to fix it:

Best regards
Lukas

Thanks Lukas,

This has been running flawlessly thanks to your pointer.

We got a bit worried because we took apart parts of our network to figure out if there was any issue with spanning tree or routing loops and didnt find any.

Thanks