Cluster status turn yellow everyday morning 8:00

KeithTt · March 23, 2021, 3:55am

ES version: 6.3.0

I get the pending tasks and thread pool status, here is the info:

{
    "tasks": [
        {
            "insert_order": 327314,
            "priority": "URGENT",
            "source": "create-index [.monitoring-kibana-6-2021.03.23], cause [auto(bulk api)]",
            "executing": true,
            "time_in_queue_millis": 1384,
            "time_in_queue": "1.3s"
        },
        {
            "insert_order": 327315,
            "priority": "URGENT",
            "source": "shard-started StartedShardEntry{shardId [[.monitoring-kibana-6-2021.03.23][0]], allocationId [_hIkEbnkSBejLOv3a55usw], message [after new shard recovery]}",
            "executing": false,
            "time_in_queue_millis": 1109,
            "time_in_queue": "1.1s"
        },
        {
            "insert_order": 327316,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 923,
            "time_in_queue": "923ms"
        },
        {
            "insert_order": 327338,
            "priority": "URGENT",
            "source": "install-token-metadata",
            "executing": false,
            "time_in_queue_millis": 382,
            "time_in_queue": "382ms"
        },
        {
            "insert_order": 327318,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 888,
            "time_in_queue": "888ms"
        },
        {
            "insert_order": 327319,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 863,
            "time_in_queue": "863ms"
        },
        {
            "insert_order": 327339,
            "priority": "URGENT",
            "source": "update-settings",
            "executing": false,
            "time_in_queue_millis": 379,
            "time_in_queue": "379ms"
        },
        {
            "insert_order": 327321,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 798,
            "time_in_queue": "798ms"
        },
        {
            "insert_order": 327322,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 760,
            "time_in_queue": "760ms"
        },
        {
            "insert_order": 327323,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 749,
            "time_in_queue": "749ms"
        },
        {
            "insert_order": 327324,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 654,
            "time_in_queue": "654ms"
        },
        {
            "insert_order": 327325,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 651,
            "time_in_queue": "651ms"
        },
        {
            "insert_order": 327317,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 918,
            "time_in_queue": "918ms"
        },
        {
            "insert_order": 327327,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 642,
            "time_in_queue": "642ms"
        },
        {
            "insert_order": 327328,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 624,
            "time_in_queue": "624ms"
        },
        {
            "insert_order": 327329,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 585,
            "time_in_queue": "585ms"
        },
        {
            "insert_order": 327330,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 584,
            "time_in_queue": "584ms"
        },
        {
            "insert_order": 327331,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 584,
            "time_in_queue": "584ms"
        },
        {
            "insert_order": 327332,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 578,
            "time_in_queue": "578ms"
        },
        {
            "insert_order": 327333,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 575,
            "time_in_queue": "575ms"
        },
        {
            "insert_order": 327334,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 564,
            "time_in_queue": "564ms"
        },
        {
            "insert_order": 327335,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 549,
            "time_in_queue": "549ms"
        },
        {
            "insert_order": 327336,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 526,
            "time_in_queue": "526ms"
        },
        {
            "insert_order": 327337,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 421,
            "time_in_queue": "421ms"
        },
        {
            "insert_order": 327326,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 650,
            "time_in_queue": "650ms"
        },
        {
            "insert_order": 327320,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 851,
            "time_in_queue": "851ms"
        },
        {
            "insert_order": 327340,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 265,
            "time_in_queue": "265ms"
        }
    ]
}

node_name name   active queue rejected
m-21-58   search      0     1    31911
m-21-58   write      20    32    59787
m-21-59   search      1     0  1831283
m-21-59   write      20   318    12438
m-21-60   search      2     0  1239048
m-21-60   write      20   309    11588
m-21-61   search      0     0  4365153
m-21-61   write      20   381     6550
m-21-62   search      2     0    63790
m-21-62   write      20   134    16337
m-21-63   search      2     0  1946236
m-21-63   write       0     0    12383

The pool size is 20, all the threads are active and blocked at that point, and I see the index .monitoring-kibana-6-2021.03.23 is not large, so what is wrong with it ?

warkolm · March 23, 2021, 4:34am

What timezone are you in? It could be other indices being auto-created at that same time, causing a bit of a backlog.

KeithTt · March 23, 2021, 4:57am

I am in china utc+8.

It is also what I guess, and do I need to do anything for this?

And if there is something about timezone, why the index is created in the morning but mid-night?

Christian_Dahlqvist · March 23, 2021, 5:20am

Daily indices are created based on UTC timezone. Creation should be quick so the fact that it is taking a very long time in your cluster is concerning. Can you provide the full output of the cluster stats API?

KeithTt · March 23, 2021, 6:09am

I have many daily indices, include the nginx logs.

How quick do you mean? I write a script to check the cluster status every 10 minutes.

And here is the output from cluster stats API:

{
  "_nodes": {
    "total": 9,
    "successful": 9,
    "failed": 0
  },
  "cluster_name": "moji",
  "timestamp": 1616479549244,
  "status": "green",
  "indices": {
    "count": 496,
    "shards": {
      "total": 4664,
      "primaries": 2332,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.403225806451612
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.701612903225806
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 12651977755,
      "deleted": 847545
    },
    "store": {
      "size_in_bytes": 13213327263990
    },
    "fielddata": {
      "memory_size_in_bytes": 11794722192,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 64530,
      "memory_in_bytes": 32398106814,
      "terms_memory_in_bytes": 27192193374,
      "stored_fields_memory_in_bytes": 4123461960,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 1009984,
      "points_memory_in_bytes": 998085016,
      "doc_values_memory_in_bytes": 83356480,
      "index_writer_memory_in_bytes": 2384296592,
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set_memory_in_bytes": 5107632,
      "max_unsafe_auto_id_timestamp": 1616469512315,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 9,
      "data": 6,
      "coordinating_only": 0,
      "master": 9,
      "ingest": 9
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 180,
      "allocated_processors": 180,
      "names": [
        {
          "name": "Linux",
          "count": 9
        }
      ],
      "mem": {
        "total_in_bytes": 605419995136,
        "free_in_bytes": 68049809408,
        "used_in_bytes": 537370185728,
        "free_percent": 11,
        "used_percent": 89
      }
    },
    "process": {
      "cpu": {
        "percent": 220
      },
      "open_file_descriptors": {
        "min": 835,
        "max": 4572,
        "avg": 3238
      }
    },
    "jvm": {
      "max_uptime_in_millis": 8305961371,
      "versions": [
        {
          "version": "1.8.0_131",
          "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version": "25.131-b11",
          "vm_vendor": "Oracle Corporation",
          "count": 9
        }
      ],
      "mem": {
        "heap_used_in_bytes": 145357680336,
        "heap_max_in_bytes": 260756996096
      },
      "threads": 2935
    },
    "fs": {
      "total_in_bytes": 40695274758144,
      "free_in_bytes": 26690833195008,
      "available_in_bytes": 26690833195008
    },
    "plugins": [
      {
        "name": "prometheus-exporter",
        "version": "6.3.0.0",
        "elasticsearch_version": "6.3.0",
        "java_version": "1.8",
        "description": "Export ElasticSearch metrics to Prometheus",
        "classname": "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
        "extended_plugins": [],
        "has_native_controller": false
      },
      {
        "name": "analysis-ik",
        "version": "6.3.0",
        "elasticsearch_version": "6.3.0",
        "java_version": "1.8",
        "description": "IK Analyzer for Elasticsearch",
        "classname": "org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin",
        "extended_plugins": [],
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "security4": 9
      },
      "http_types": {
        "security4": 9
      }
    }
  }
}

Christian_Dahlqvist · March 23, 2021, 6:31am

How many daily indices do you have? How long does the state stay yellow?

KeithTt · March 23, 2021, 6:52am

Except system indices, I have about 100 daily indices, but they are created by logstash.

I think the indices from logstash are created in the mid-night of china.

And is there anything abnormal from the cluster stats info?

KeithTt · March 23, 2021, 6:56am

The yellow status not stay very long, but it was catch by the check script just at that moment.

Christian_Dahlqvist · March 23, 2021, 6:58am

Then you have a lot of indices created at basically the same time every day, which is why they take time to allocate. I would recommend either consolidating a lot of these indices into fewer larger ones, which would mean fewer indices and shards would need to be allocated at the same time, or switch to using rollover for indices that does not need to be updated. Using rollover allows you to create new indices based on age or size which means they will not all be created at the same time.

KeithTt · March 23, 2021, 7:11am

We seems out of the way.

The 100 daily indices of nginx log are created by logstash in the mid-night of china.

And the current status appear in the morning at 8:00, maybe there are also many system indices?

Christian_Dahlqvist · March 23, 2021, 7:17am

Logstash by default creates index names based on UTC. How are you setting index name in Logstash? Are you not using the default behavior?

There should not be many system indices so I would not expect those to cause this type of problem.

KeithTt · March 23, 2021, 7:27am

I change the time to the local time in logstash configuration.

But what happened lead to the block?

Christian_Dahlqvist · March 23, 2021, 7:29am

There are a lot of put mapping tasks in the queue that seems to slow down the shard creation and allocation. Are you using dynamic mappings? Do you have indices with very large mappings and number of fields? All these require cluster state updates and would impact eachother as they need to be processed sequentially.

Note that version 6.3 is very old and I believe there has been improvements in this area. I would therefore recommend that you upgrade.

KeithTt · March 23, 2021, 7:36am

I do not set any dynamic mappings for the system indices like [.monitoring-kibana-6-2021.03.23] manually, these behaviors should perform by default.

Maybe I can find out the mappings and do something to change this behavior?

Christian_Dahlqvist · March 23, 2021, 7:37am

This is not related to the mappings of the monitoring indices. Are you using dynamic mappings for the indices created by Logstash? Do these have large mappings and/or large number of fields? Do you have any index templates in place for the Logstash created indices?

If the large number of indices created by Logstash are constantly updating mappings, this will result in a large number of cluster state updates which will affect the creation of the system indices.

KeithTt · March 23, 2021, 7:52am

The daily indices have 52 fields, I do not think it is a big number, and I do not use dynamic settings.

I can get the index template in kibana7.11 but in kibana6.3, I did not modify the index template.

dadoonet · March 23, 2021, 9:10am

Yes. I remember that some indices might wrongly report a weird status at creation time. Not sur if it's the case here but it's definitely worth upgrading at the very least to 6.8.x or better to 7.11.2.

KeithTt · March 24, 2021, 3:10am

It is not easy to upgrade a production cluster with large data, I will check the documents and evaluate the feasibility.

Thank you all!!!

system · April 21, 2021, 3:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster health is "status": "yellow" Elasticsearch	6	1475	January 9, 2018
Pending tasks queue Elasticsearch	8	3367	July 5, 2017
Elasticsearch cluster have millions of pending tasks Elasticsearch	15	1141	June 8, 2021
Elasticsearch continuously on yellow Status - Unassigned Shards Elasticsearch	2	222	April 12, 2023
Elasticsearch "status" : "yellow" How to fix the status? Elasticsearch	4	996	November 10, 2022

Cluster status turn yellow everyday morning 8:00

Related topics