Kibana timeout during discover. Elasticsearch error: collector [node_stats] timed out when collecting data

Hello,

I have problems ( errors below) to discover indexes in kibana, some indexes loads after 2nd attempt like dev.k8s-2019.09.09 (900Mb) , some indexes are more problematic with discovering in kibana like dev.k8s.apps-2019.09.09 (bigger size 1.2Gb) and could load after several attempts only.

Also, I notice that elastic cluster if bouncing between green and yellow after some time, is it beacause of replicas enabled?

curl -u elastic  -k localhost:9200/_cluster/health?pretty
Enter host password for user 'elastic':
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 12,
  "active_shards" : 14,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 8,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 1,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 58.333333333333336
}

Please suggest what kind of performance tunning could help with this issue.
So far I'm thinking to filter unwanted logs in order to minimize index size, but it's not like a long-term solution.

curl -u elastic -XGET "http://localhost:9200/_nodes/status" Enter host password for user 'elastic': {"_nodes":{"total":0,"successful":0,"failed":0},"cluster_name":"elasticsearch","nodes":{}}

curl -u elastic -XGET "http://localhost:9200/_cat/indices"
green open dev.k8s.apps-2019.09.06   3Km10ONYSzytXsfhwrHhsg 1 1   92239   0 141.9mb  70.8mb
green open .kibana_task_manager            gSrVc5nbQey5epC7EQYtgA 1 1       2   0  58.4kb  45.5kb
green open .monitoring-es-7-2019.09.09     ue7p0il-S36Fwsg-1FwKJQ 1 1     193 120   3.3mb   1.6mb
green open .security-7                     q-guIlieQWqcWa1oKSENzQ 1 1      44 193 609.7kb 268.8kb
green open dev.k8s-2019.09.08              1AoYvbfqRpiKa5HZcT6sRg 1 1  736529   0   751mb 374.3mb
green open .kibana_1                       NuRBeYShSaiu_2U-h70XBg 1 1       7   0   138kb  66.1kb
green open dev.k8s.apps-2019.09.07   VVZCbdD4Tay20EjHWpvl7g 1 1   50865   0  84.3mb    42mb
green open .monitoring-kibana-7-2019.09.09 uy3sZ_CZR3GKOkTYJXnUTQ 1 1     217   0 429.5kb 222.2kb
green open dev.k8s.apps-2019.09.08   L4x7AkE2RAa0tymMPuueLA 1 1       0   0  20.4mb   9.4mb
green open dev.k8s.apps-2019.09.09   fgQTd72MSYqFhQJQDe4n0w 1 1 2016066   0   1.2gb 607.8mb
green open dev.k8s-2019.09.09              Z_Nr2kWSREeh7XcmoHSaFg 1 1  632409   0 925.1mb 472.1mb
green open dev.k8s.apps-2019.09.05   MaC_nAE1Rf2VDysP37MVMw 1 1   52765   0  77.6mb  39.7mb

Kibana error (after increased timeout from 30000ms to 90000ms :
Kibana Discover: Request Timeout after 90000ms

master-0 error
{"type": "server", "timestamp": "2019-09-09T10:15:55,342+0000", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [cluster_stats] timed out when collecting data" }

master-1 error:
{"type": "server", "timestamp": "2019-09-09T10:36:15,063+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:06,171+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:45,406+0000", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [cluster_stats] timed out when collecting data" } {"type": "server", "timestamp": "2019-09-09T10:37:51,473+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:55,411+0000", "level": "ERROR", "component": "o.e.x.m.c.i.IndexStatsCollector", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [index-stats] timed out when collecting data" }

One more error after some time:
"Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:473) ~[elasticsearch-7.2.0.jar:7.2.0]", "... 10 more"] } {"type": "server", "timestamp": "2019-09-09T10:43:43,439+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "node.id": "bf9AAUgOQuufbPv_DqhPKg", "message": "path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}" , "stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:175) ~[elasticsearch-7.2.0.jar:7.2.0]",

Elasticsearch helm chart used:

---
clusterName: "elasticsearch"
nodeGroup: "master"

# The service that non master groups will try to connect to when joining the cluster
# This should be set to clusterName + "-" + nodeGroup for your master group
masterService: ""

# Elasticsearch roles that will be applied to this nodeGroup
# These will be set as environment variables. E.g. node.master=true
roles:
  master: "true"
  ingest: "true"
  data: "true"

replicas: 2
minimumMasterNodes: 2

esMajorVersion: ""

esConfig: {}

extraEnvs:
   - name: ES_JAVA_OPTS
     value: "-Xms2g -Xmx2g"
   - name: ELASTIC_USERNAME
     value: obfuscated
   - name: ELASTIC_PASSWORD
     value: obfuscated
   - name: xpack.security.enabled
     value: "true"

secretMounts: []

image: "docker.elastic.co/elasticsearch/elasticsearch"
imageTag: "7.2.0"
imagePullPolicy: "IfNotPresent"

podAnnotations: {}
  # iam.amazonaws.com/role: es-cluster

# additionals labels
labels: {}


resources:
  requests:
    cpu: "100m"
    memory: "4Gi"
  limits:
    cpu: "1000m"
    memory: "4Gi"

initResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

sidecarResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

networkHost: "0.0.0.0"

volumeClaimTemplate:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 10Gi

persistence:
  enabled: true
  annotations: {}

antiAffinityTopologyKey: "kubernetes.io/hostname"

# Hard means that by default pods will only be scheduled if there are enough nodes for them
# and that they will never end up on the same node. Setting this to soft will do this "best effort"
antiAffinity: "soft"

# This is the node affinity settings as defined in
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature
nodeAffinity: {}

# The default is to deploy all pods serially. By setting this to parallel all pods are started at
# the same time when bootstrapping the cluster
podManagementPolicy: "Parallel"

protocol: http
httpPort: 9200
transportPort: 9300

service:
  type: ClusterIP
  nodePort:
  annotations: {}

updateStrategy: RollingUpdate

# This is the max unavailable setting for the pod disruption budget
# The default value of 1 will make sure that kubernetes won't allow more than 1
# of your pods to be unavailable during maintenance
maxUnavailable: 1

podSecurityContext:
  fsGroup: 1000

# The following value is deprecated,
# please use the above podSecurityContext.fsGroup instead
fsGroup: ""

securityContext:
  capabilities:
    drop:
    - ALL
  # readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# How long to wait for elasticsearch to stop gracefully
terminationGracePeriod: 120

sysctlVmMaxMapCount: 262144

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

# https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html#request-params wait_for_status
clusterHealthCheckParams: "wait_for_status=green&timeout=1s"


master:
  name: master
  exposeHttp: false
  replicas: 3
  heapSize: "512m"
  # additionalJavaOpts: "-XX:MaxRAM=512m"
  persistence:
    enabled: true
    accessMode: ReadWriteOnce
    name: data
    size: "4Gi"
    # storageClass: "ssd"
  readinessProbe:
    httpGet:
      path: /_cluster/health?local=true
      port: 9200
    initialDelaySeconds: 5
  antiAffinity: "soft"

  resources:
    limits:
      cpu: "1"
      # memory: "1024Mi"
    requests:
      cpu: "25m"
      memory: "512Mi"


sysctlInitContainer:
  enabled: true

Is index size ~700Mb is something sensitive for timeouts? I noticed the problem is with index size > 500Mb . What should i tune? Checking this https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.