How to resolve "Failed to poll for work. Work has timed out" error

Hello, dear elastic community,

We recently upgraded our ELK stack to 7.15.2 from 7.7.0.
All my nodes are on the same version (7.15.2)
After the upgrade, our Kibana's status goes yellow repeatedly. When it goes yellow I checked the cluster and nodes to see anything is going wrong but all cluster nodes are active and the cluster status is green.
After I checked the Kibana logs I saw this error:

{"type":"log","@timestamp":"2021-12-12T06:16:11-05:00","tags":["error","plugins","taskManager"],"pid":1451,"message":"Failed to poll for work: Error: work has timed out"}
{"type":"log","@timestamp":"2021-12-12T06:16:56-05:00","tags":["info","status"],"pid":1451,"message":"Kibana is now available (was degraded)"}
{"type":"log","@timestamp":"2021-12-12T06:17:13-05:00","tags":["error","plugins","taskManager"],"pid":1451,"message":"Failed to poll for work: Error: work has timed out"}
{"type":"log","@timestamp":"2021-12-12T06:17:22-05:00","tags":["info","status"],"pid":1451,"message":"Kibana is now degraded (was available)"}

I ran this curl to check task manager but I really don't have any idea about task manager stuff and how to proceed with this error.

curl -X GET "http://localhost:5601/api/task_manager/_health?pretty" -H 'kbn-xsrf: true'

The response:

{
    "id": "6461ac68-65ce-453f-9f8a-9652a2ea06f1",
    "timestamp": "2021-12-12T11:25:53.075Z",
    "status": "OK",
    "last_update": "2021-12-12T11:25:47.878Z",
    "stats": {
        "configuration": {
            "timestamp": "2021-12-12T10:40:21.887Z",
            "value": {
                "request_capacity": 1000,
                "max_poll_inactivity_cycles": 10,
                "monitored_aggregated_stats_refresh_rate": 60000,
                "monitored_stats_running_average_window": 50,
                "monitored_task_execution_thresholds": {
                    "default": {
                        "error_threshold": 90,
                        "warn_threshold": 80
                    },
                    "custom": {}
                },
                "poll_interval": 3000,
                "max_workers": 10
            },
            "status": "OK"
        }
    }
}

I saw the poll_interval field is set to 3000, and I thought maybe increasing that might solve my problem but I don't know how to configure the task manager's settings. I checked the documents for the task manager but couldn't find a way to configure it.

Long story short, could configure the poll_interval field to fix my problem? If so, how can I configure my task manager settings?

Thank you for reading I'm looking forward to your reply.

We need help guys, please let us know if you know anything about this issue. We tried to increase the poll_interval field to 5000 but it didn't work. Our Kibana goes yellow a lot and the only log for error is this one

Seems more like an ES problem than Kibana. How many documents are you ingesting? How's the memory usage of Elasticsearch? Is it possible there are networking problems among the hosts?
If you aren't running monitoring yet (https://www.elastic.co/guide/en/elasticsearch/reference/current/monitoring-overview.html ), it would probably help giving you some visibility into what's happening inside your cluster. I'm no expert on performance tuning, but all you wrote so far seems to point to performance/capacity issues, so this looks like a good starting point to me.

  1. Elasticsearch is somehow down. (Check its logs to see if an error is happening on your Elasticsearch cluster)
  2. A proxy is in-between you and your cluster and the proxy is misbehaving.

Thanks
Rashmi

Thank you for your reply! @rashmi

Yes, currently, we are having a heap problem with our cluster. The error message says basically

{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [4156279272/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4156279272/3.8gb]

We have too many shards per node and we assume this is the problem for this heap issue. We'll take action about it.

Though just for clarification, can this heap problem be the main reason for our Kibana's problem?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.