Kibana timeout during discover. Elasticsearch error: collector [node_stats] timed out when collecting data


I have problems ( errors below) to discover indexes in kibana, some indexes loads after 2nd attempt like dev.k8s-2019.09.09 (900Mb) , some indexes are more problematic with discovering in kibana like dev.k8s.apps-2019.09.09 (bigger size 1.2Gb) and could load after several attempts only.

Also, I notice that elastic cluster if bouncing between green and yellow after some time, is it beacause of replicas enabled?

curl -u elastic  -k localhost:9200/_cluster/health?pretty
Enter host password for user 'elastic':
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 12,
  "active_shards" : 14,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 8,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 1,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 58.333333333333336

Please suggest what kind of performance tunning could help with this issue.
So far I'm thinking to filter unwanted logs in order to minimize index size, but it's not like a long-term solution.

curl -u elastic -XGET "http://localhost:9200/_nodes/status" Enter host password for user 'elastic': {"_nodes":{"total":0,"successful":0,"failed":0},"cluster_name":"elasticsearch","nodes":{}}

curl -u elastic -XGET "http://localhost:9200/_cat/indices"
green open dev.k8s.apps-2019.09.06   3Km10ONYSzytXsfhwrHhsg 1 1   92239   0 141.9mb  70.8mb
green open .kibana_task_manager            gSrVc5nbQey5epC7EQYtgA 1 1       2   0  58.4kb  45.5kb
green open .monitoring-es-7-2019.09.09     ue7p0il-S36Fwsg-1FwKJQ 1 1     193 120   3.3mb   1.6mb
green open .security-7                     q-guIlieQWqcWa1oKSENzQ 1 1      44 193 609.7kb 268.8kb
green open dev.k8s-2019.09.08              1AoYvbfqRpiKa5HZcT6sRg 1 1  736529   0   751mb 374.3mb
green open .kibana_1                       NuRBeYShSaiu_2U-h70XBg 1 1       7   0   138kb  66.1kb
green open dev.k8s.apps-2019.09.07   VVZCbdD4Tay20EjHWpvl7g 1 1   50865   0  84.3mb    42mb
green open .monitoring-kibana-7-2019.09.09 uy3sZ_CZR3GKOkTYJXnUTQ 1 1     217   0 429.5kb 222.2kb
green open dev.k8s.apps-2019.09.08   L4x7AkE2RAa0tymMPuueLA 1 1       0   0  20.4mb   9.4mb
green open dev.k8s.apps-2019.09.09   fgQTd72MSYqFhQJQDe4n0w 1 1 2016066   0   1.2gb 607.8mb
green open dev.k8s-2019.09.09              Z_Nr2kWSREeh7XcmoHSaFg 1 1  632409   0 925.1mb 472.1mb
green open dev.k8s.apps-2019.09.05   MaC_nAE1Rf2VDysP37MVMw 1 1   52765   0  77.6mb  39.7mb

Kibana error (after increased timeout from 30000ms to 90000ms :
Kibana Discover: Request Timeout after 90000ms

master-0 error
{"type": "server", "timestamp": "2019-09-09T10:15:55,342+0000", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [cluster_stats] timed out when collecting data" }

master-1 error:
{"type": "server", "timestamp": "2019-09-09T10:36:15,063+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:06,171+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:45,406+0000", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [cluster_stats] timed out when collecting data" } {"type": "server", "timestamp": "2019-09-09T10:37:51,473+0000", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout" } {"type": "server", "timestamp": "2019-09-09T10:37:55,411+0000", "level": "ERROR", "component": "o.e.x.m.c.i.IndexStatsCollector", "": "elasticsearch", "": "elasticsearch-master-1", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "n1dOuodeSNaouzeqxlSbAw", "message": "collector [index-stats] timed out when collecting data" }

One more error after some time:
"Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException( ~[elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions( ~[elasticsearch-7.2.0.jar:7.2.0]", "... 10 more"] } {"type": "server", "timestamp": "2019-09-09T10:43:43,439+0000", "level": "WARN", "component": "r.suppressed", "": "elasticsearch", "": "elasticsearch-master-0", "cluster.uuid": "m-aesYQJRBa1mOPx19WQUg", "": "bf9AAUgOQuufbPv_DqhPKg", "message": "path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}" , "stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException( ~[elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException( ~[elasticsearch-7.2.0.jar:7.2.0]",

Elasticsearch helm chart used:

clusterName: "elasticsearch"
nodeGroup: "master"

# The service that non master groups will try to connect to when joining the cluster
# This should be set to clusterName + "-" + nodeGroup for your master group
masterService: ""

# Elasticsearch roles that will be applied to this nodeGroup
# These will be set as environment variables. E.g. node.master=true
  master: "true"
  ingest: "true"
  data: "true"

replicas: 2
minimumMasterNodes: 2

esMajorVersion: ""

esConfig: {}

   - name: ES_JAVA_OPTS
     value: "-Xms2g -Xmx2g"
     value: obfuscated
     value: obfuscated
   - name:
     value: "true"

secretMounts: []

image: ""
imageTag: "7.2.0"
imagePullPolicy: "IfNotPresent"

podAnnotations: {}
  # es-cluster

# additionals labels
labels: {}

    cpu: "100m"
    memory: "4Gi"
    cpu: "1000m"
    memory: "4Gi"

initResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

sidecarResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

networkHost: ""

  accessModes: [ "ReadWriteOnce" ]
      storage: 10Gi

  enabled: true
  annotations: {}

antiAffinityTopologyKey: ""

# Hard means that by default pods will only be scheduled if there are enough nodes for them
# and that they will never end up on the same node. Setting this to soft will do this "best effort"
antiAffinity: "soft"

# This is the node affinity settings as defined in
nodeAffinity: {}

# The default is to deploy all pods serially. By setting this to parallel all pods are started at
# the same time when bootstrapping the cluster
podManagementPolicy: "Parallel"

protocol: http
httpPort: 9200
transportPort: 9300

  type: ClusterIP
  annotations: {}

updateStrategy: RollingUpdate

# This is the max unavailable setting for the pod disruption budget
# The default value of 1 will make sure that kubernetes won't allow more than 1
# of your pods to be unavailable during maintenance
maxUnavailable: 1

  fsGroup: 1000

# The following value is deprecated,
# please use the above podSecurityContext.fsGroup instead
fsGroup: ""

    - ALL
  # readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# How long to wait for elasticsearch to stop gracefully
terminationGracePeriod: 120

sysctlVmMaxMapCount: 262144

  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

# wait_for_status
clusterHealthCheckParams: "wait_for_status=green&timeout=1s"

  name: master
  exposeHttp: false
  replicas: 3
  heapSize: "512m"
  # additionalJavaOpts: "-XX:MaxRAM=512m"
    enabled: true
    accessMode: ReadWriteOnce
    name: data
    size: "4Gi"
    # storageClass: "ssd"
      path: /_cluster/health?local=true
      port: 9200
    initialDelaySeconds: 5
  antiAffinity: "soft"

      cpu: "1"
      # memory: "1024Mi"
      cpu: "25m"
      memory: "512Mi"

  enabled: true

Is index size ~700Mb is something sensitive for timeouts? I noticed the problem is with index size > 500Mb . What should i tune? Checking this

