Kibana error "Request Timeout after 30000ms" during login

Hello,

Do you know what might be the reason of this error? It's very annoying

{"type":"log","@timestamp":"2019-09-08T14:50:12Z","tags":["error","task_manager"],"pid":1,"message":"Failed to poll for work: Error: Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2019-09-08T14:50:45Z","tags":["error","task_manager"],"pid":1,"message":"Failed to poll for work: Error: Request Timeout after 30000ms"}

{"type":"response","@timestamp":"2019-09-08T14:50:09Z","tags":[],"pid":1,"method":"post","statusCode":503,"req":{"url":"/api/security/v1/login","method":"post","headers":{"host":"localhost:5601","user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0","accept":"application/json, text/plain, */*","accept-language":"en-US,en;q=0.5","accept-encoding":"gzip, deflate","content-type":"application/json;charset=utf-8","kbn-version":"7.2.0","content-length":"44","connection":"keep-alive","referer":"http://localhost:5601/login?next=%2F"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1","referer":"http://localhost:5601/login?next=%2F"},"res":{"statusCode":503,"responseTime":35757,"contentLength":9},"message":"POST /api/security/v1/login 503 35757ms - 9.0B"}

Elasticsearch master0 logs:

{"type": "server", "timestamp": "2019-09-08T14:28:15,092+0000", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "cluster.uuid": "6lJ-oAsCQYi5Tndu0CpWDw", "node.id": "FsyZhvdITuqkUSYbgMSqRA", "message": "Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_1][0]] ...])." }

Elasticsearch master1 logs:

{"type": "server", "timestamp": "2019-09-08T14:23:59,035+0000", "level": "DEBUG", "component": "o.e.a.s.TransportSearchAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "6lJ-oAsCQYi5Tndu0CpWDw", "node.id": "E9Dj0FZmT22bVUZsWOwHLw",  "message": "All shards failed for phase: [query]"  }
{"type": "server", "timestamp": "2019-09-08T14:23:59,036+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "6lJ-oAsCQYi5Tndu0CpWDw", "node.id": "E9Dj0FZmT22bVUZsWOwHLw",  "message": "path: /.kibana_task_manager/_search, params: {ignore_unavailable=true, index=.kibana_task_manager}" , 
"stacktrace": ["org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-7.2.0.jar:7.2.0]",


curl -u elastic http://localhost:9200/_cluster/health?pretty
Enter host password for user 'elastic':
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 3,
  "active_shards" : 6,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

I haven't started any log collection tool like fluentd yet

Update: restarting por with error all shards failed" solved login issue.
Do you know what is the root cause?

hi,
he requests that are timing out are the ones that reach out to Elasticsearch. Is there something between Kibana and Elasticsearch, like a proxy? Chances are pretty good that either

  1. Elasticsearch is somehow down. (Check its logs to see if an error is happening on your Elasticsearch cluster)
  2. A proxy is in-between you and your cluster and the proxy is misbehaving.

Proxies can be made to hide, but most add a header to the response if you're looking for it. Can you copy the curl command here with its response and add -v to it to see its header output as well?

IF it isn't a proxy, we will need to dig further.

Thanks
Rashmi

Thanks for answering @rashmi

We use istio+kubernetes, for inter-cluster communication we use annotation sidecar.istio.io/inject: "false" and there is no istio sidecar for elastic stack apps

This is output from recently refreshed cluster (indexes were removed)

curl -v -k https://myelastic.apps.example.com/_cluster/health?pretty
*   Trying 111.111.111.111...
* TCP_NODELAY set
* Connected to myelastic.apps.example.com (111.111.111.111) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.myelastic.apps.example.com
*  start date: Aug 27 17:15:26 2019 GMT
*  expire date: Nov 25 17:15:26 2019 GMT
*  issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Server auth using Basic with user 'elastic'
* Using Stream ID: 1 (easy handle 0x55867f65f580)
> GET /_cluster/health?pretty HTTP/2
> Host: myelastic.apps.example.com
> Authorization: Basic Zddddd3RpYzpjaGFuddddd==
> User-Agent: curl/7.58.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200 
< content-type: application/json; charset=UTF-8
< content-length: 467
< date: Wed, 11 Sep 2019 10:31:51 GMT
< 
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
* Connection #0 to host myelastic.apps.example.com left intact

@jbudz /@tylersmalley - can you please help here?

Thanks
Rashmi

update: noted , timeouts start once index size reaches >500mb

It's difficult to give advice on the performance issues you are facing without having further details on how your Elasticsearch infrastructure is architected and built. Could you provide some more info on what setup you are using on both AWS & when building on-premise?

Also, it would be useful to see your elasticsearch & kibana configuration files (please don't forget to remove any secrets though, if you have any!).

Also, the request timeout can be increased in kibana.conf file. Just added elasticsearch.requestTimeout: 90000 to /etc/kibana/kibana.yml and check if the error goes away.

Thanks
Rashmi

I already increased timeout to 90000 and I use IBM cloud with storage class 2 IOPS/Gb(it's called bronze)

@tylersmalley or @joshdover can you please help ?

Thanks
Rashmi

If requests are timing out to Elasticsearch it's usually either network-related or the cluster is under extremely high load.

You could enable Slow Log to try to gain insights on the ES side. For the network, you could use the Heartbeat client to track the connection to ES.

Here are the graph of Kibana response time and elastic health status

If your average ES response time is 72 seconds something is indeed provisioned incorrectly. Looks like you have a two node cluster? How do each of those instances look? Is the memory, disk, or CPU contention?

Here are my helm chart parameters, used from https://hub.helm.sh/charts/elastic/elasticsearch

Please let me know whether any parameters could be tuned further

---
clusterName: "elasticsearch"
nodeGroup: "master"


replicas: 2
minimumMasterNodes: 2

esMajorVersion: ""

# Allows you to add any config files in /usr/share/elasticsearch/config/
# such as elasticsearch.yml and log4j2.properties
esConfig: {}

extraEnvs:

   - name: ES_JAVA_OPTS
     value: "-Xms2g -Xmx2g"
   - name: ELASTIC_USERNAME
     value: xxx
   - name: ELASTIC_PASSWORD
     value: xxx
   - name: xpack.security.enabled
     value: "true"

secretMounts: []


image: "docker.elastic.co/elasticsearch/elasticsearch"
imageTag: "7.2.0"
imagePullPolicy: "IfNotPresent"

podAnnotations: {}

labels: {}

resources:
  requests:
    cpu: "200m"
    memory: "4Gi"
  limits:
    cpu: "2000m"
    memory: "4Gi"

initResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

sidecarResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

networkHost: "0.0.0.0"

volumeClaimTemplate:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 10Gi

persistence:
  enabled: true
  annotations: {}

extraVolumes: []
  # - name: extras
  #   emptyDir: {}

extraVolumeMounts: []
  # - name: extras
  #   mountPath: /usr/share/extras
  #   readOnly: true

extraInitContainers: []
#- name: do-something
#   image: busybox
#   command: ['do', 'something']

# This is the PriorityClass settings as defined in
# https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
priorityClassName: ""

# By default this will make sure two pods don't end up on the same node
# Changing this to a region would allow you to spread pods across regions
antiAffinityTopologyKey: "kubernetes.io/hostname"

# Hard means that by default pods will only be scheduled if there are enough nodes for them
# and that they will never end up on the same node. Setting this to soft will do this "best effort"
antiAffinity: "soft"

# This is the node affinity settings as defined in
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature
nodeAffinity: {}

# The default is to deploy all pods serially. By setting this to parallel all pods are started at
# the same time when bootstrapping the cluster
podManagementPolicy: "Parallel"

protocol: http
httpPort: 9200
transportPort: 9300

service:
  type: ClusterIP
  nodePort:
  annotations: {}

updateStrategy: RollingUpdate

# This is the max unavailable setting for the pod disruption budget
# The default value of 1 will make sure that kubernetes won't allow more than 1
# of your pods to be unavailable during maintenance
maxUnavailable: 1

podSecurityContext:
  fsGroup: 1000

# The following value is deprecated,
# please use the above podSecurityContext.fsGroup instead
fsGroup: ""

securityContext:
  capabilities:
    drop:
    - ALL
  # readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# How long to wait for elasticsearch to stop gracefully
terminationGracePeriod: 120

sysctlVmMaxMapCount: 262144

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

# https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html#request-params wait_for_status
clusterHealthCheckParams: "wait_for_status=green&timeout=1s"

## Use an alternate scheduler.
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
schedulerName: ""

imagePullSecrets: []
nodeSelector: {}
tolerations: []

# Enabling this will publically expose your Elasticsearch instance.
# Only enable this if you have security enabled on your cluster
ingress:
  enabled: false
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  path: /
  hosts:
    - chart-example.local
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nameOverride: ""
fullnameOverride: ""

# https://github.com/elastic/helm-charts/issues/63
masterTerminationFix: false

master:
  name: master
  exposeHttp: false
  replicas: 3
  heapSize: "512m"
  # additionalJavaOpts: "-XX:MaxRAM=512m"
  persistence:
    enabled: true
    accessMode: ReadWriteOnce
    name: data
    size: "4Gi"
    # storageClass: "ssd"
  readinessProbe:
    httpGet:
      path: /_cluster/health?local=true
      port: 9200
    initialDelaySeconds: 5
  # antiAffinity: "soft"
  antiAffinity: "soft"
  nodeAffinity: {}
  nodeSelector: {}
  tolerations: []
  # terminationGracePeriodSeconds: 60
  initResources: {}
    # limits:
    #   cpu: "25m"
    #   # memory: "128Mi"
    # requests:
    #   cpu: "25m"
    #   memory: "128Mi"
  resources:
    limits:
      cpu: "1"
      # memory: "1024Mi"
    requests:
      cpu: "25m"
      memory: "512Mi"

lifecycle: {}
  # preStop:
  #   exec:
  #     command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
  # postStart:
  #   exec:
  #     command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]

sysctlInitContainer:
  enabled: true

chownInitContainer:
  enabled: true
## Additional init containers
extraInitContainers: |

What about the resources usage of the container? What does the load look like? Limiting to a single CPU unit for a master/data node seems low.

Thank you for checking @tylersmalley
I cannot provide some historical as have limited access , but here is the CPU stat during fetching, I'm a bit concerned about top wa parameter. Also there are zombie processes, which isn't good as well