Cluster status turn yellow everyday morning 8:00

ES version: 6.3.0

I get the pending tasks and thread pool status, here is the info:

{
    "tasks": [
        {
            "insert_order": 327314,
            "priority": "URGENT",
            "source": "create-index [.monitoring-kibana-6-2021.03.23], cause [auto(bulk api)]",
            "executing": true,
            "time_in_queue_millis": 1384,
            "time_in_queue": "1.3s"
        },
        {
            "insert_order": 327315,
            "priority": "URGENT",
            "source": "shard-started StartedShardEntry{shardId [[.monitoring-kibana-6-2021.03.23][0]], allocationId [_hIkEbnkSBejLOv3a55usw], message [after new shard recovery]}",
            "executing": false,
            "time_in_queue_millis": 1109,
            "time_in_queue": "1.1s"
        },
        {
            "insert_order": 327316,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 923,
            "time_in_queue": "923ms"
        },
        {
            "insert_order": 327338,
            "priority": "URGENT",
            "source": "install-token-metadata",
            "executing": false,
            "time_in_queue_millis": 382,
            "time_in_queue": "382ms"
        },
        {
            "insert_order": 327318,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 888,
            "time_in_queue": "888ms"
        },
        {
            "insert_order": 327319,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 863,
            "time_in_queue": "863ms"
        },
        {
            "insert_order": 327339,
            "priority": "URGENT",
            "source": "update-settings",
            "executing": false,
            "time_in_queue_millis": 379,
            "time_in_queue": "379ms"
        },
        {
            "insert_order": 327321,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 798,
            "time_in_queue": "798ms"
        },
        {
            "insert_order": 327322,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 760,
            "time_in_queue": "760ms"
        },
        {
            "insert_order": 327323,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 749,
            "time_in_queue": "749ms"
        },
        {
            "insert_order": 327324,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 654,
            "time_in_queue": "654ms"
        },
        {
            "insert_order": 327325,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 651,
            "time_in_queue": "651ms"
        },
        {
            "insert_order": 327317,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 918,
            "time_in_queue": "918ms"
        },
        {
            "insert_order": 327327,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 642,
            "time_in_queue": "642ms"
        },
        {
            "insert_order": 327328,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 624,
            "time_in_queue": "624ms"
        },
        {
            "insert_order": 327329,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 585,
            "time_in_queue": "585ms"
        },
        {
            "insert_order": 327330,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 584,
            "time_in_queue": "584ms"
        },
        {
            "insert_order": 327331,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 584,
            "time_in_queue": "584ms"
        },
        {
            "insert_order": 327332,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 578,
            "time_in_queue": "578ms"
        },
        {
            "insert_order": 327333,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 575,
            "time_in_queue": "575ms"
        },
        {
            "insert_order": 327334,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 564,
            "time_in_queue": "564ms"
        },
        {
            "insert_order": 327335,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 549,
            "time_in_queue": "549ms"
        },
        {
            "insert_order": 327336,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 526,
            "time_in_queue": "526ms"
        },
        {
            "insert_order": 327337,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 421,
            "time_in_queue": "421ms"
        },
        {
            "insert_order": 327326,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 650,
            "time_in_queue": "650ms"
        },
        {
            "insert_order": 327320,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 851,
            "time_in_queue": "851ms"
        },
        {
            "insert_order": 327340,
            "priority": "HIGH",
            "source": "put-mapping",
            "executing": false,
            "time_in_queue_millis": 265,
            "time_in_queue": "265ms"
        }
    ]
}
node_name name   active queue rejected
m-21-58   search      0     1    31911
m-21-58   write      20    32    59787
m-21-59   search      1     0  1831283
m-21-59   write      20   318    12438
m-21-60   search      2     0  1239048
m-21-60   write      20   309    11588
m-21-61   search      0     0  4365153
m-21-61   write      20   381     6550
m-21-62   search      2     0    63790
m-21-62   write      20   134    16337
m-21-63   search      2     0  1946236
m-21-63   write       0     0    12383

The pool size is 20, all the threads are active and blocked at that point, and I see the index .monitoring-kibana-6-2021.03.23 is not large, so what is wrong with it ?

What timezone are you in? It could be other indices being auto-created at that same time, causing a bit of a backlog.

I am in china utc+8.

It is also what I guess, and do I need to do anything for this?

And if there is something about timezone, why the index is created in the morning but mid-night?

Daily indices are created based on UTC timezone. Creation should be quick so the fact that it is taking a very long time in your cluster is concerning. Can you provide the full output of the cluster stats API?

I have many daily indices, include the nginx logs.

How quick do you mean? I write a script to check the cluster status every 10 minutes.

And here is the output from cluster stats API:

{
  "_nodes": {
    "total": 9,
    "successful": 9,
    "failed": 0
  },
  "cluster_name": "moji",
  "timestamp": 1616479549244,
  "status": "green",
  "indices": {
    "count": 496,
    "shards": {
      "total": 4664,
      "primaries": 2332,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.403225806451612
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.701612903225806
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 12651977755,
      "deleted": 847545
    },
    "store": {
      "size_in_bytes": 13213327263990
    },
    "fielddata": {
      "memory_size_in_bytes": 11794722192,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 64530,
      "memory_in_bytes": 32398106814,
      "terms_memory_in_bytes": 27192193374,
      "stored_fields_memory_in_bytes": 4123461960,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 1009984,
      "points_memory_in_bytes": 998085016,
      "doc_values_memory_in_bytes": 83356480,
      "index_writer_memory_in_bytes": 2384296592,
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set_memory_in_bytes": 5107632,
      "max_unsafe_auto_id_timestamp": 1616469512315,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 9,
      "data": 6,
      "coordinating_only": 0,
      "master": 9,
      "ingest": 9
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 180,
      "allocated_processors": 180,
      "names": [
        {
          "name": "Linux",
          "count": 9
        }
      ],
      "mem": {
        "total_in_bytes": 605419995136,
        "free_in_bytes": 68049809408,
        "used_in_bytes": 537370185728,
        "free_percent": 11,
        "used_percent": 89
      }
    },
    "process": {
      "cpu": {
        "percent": 220
      },
      "open_file_descriptors": {
        "min": 835,
        "max": 4572,
        "avg": 3238
      }
    },
    "jvm": {
      "max_uptime_in_millis": 8305961371,
      "versions": [
        {
          "version": "1.8.0_131",
          "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version": "25.131-b11",
          "vm_vendor": "Oracle Corporation",
          "count": 9
        }
      ],
      "mem": {
        "heap_used_in_bytes": 145357680336,
        "heap_max_in_bytes": 260756996096
      },
      "threads": 2935
    },
    "fs": {
      "total_in_bytes": 40695274758144,
      "free_in_bytes": 26690833195008,
      "available_in_bytes": 26690833195008
    },
    "plugins": [
      {
        "name": "prometheus-exporter",
        "version": "6.3.0.0",
        "elasticsearch_version": "6.3.0",
        "java_version": "1.8",
        "description": "Export ElasticSearch metrics to Prometheus",
        "classname": "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
        "extended_plugins": [],
        "has_native_controller": false
      },
      {
        "name": "analysis-ik",
        "version": "6.3.0",
        "elasticsearch_version": "6.3.0",
        "java_version": "1.8",
        "description": "IK Analyzer for Elasticsearch",
        "classname": "org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin",
        "extended_plugins": [],
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "security4": 9
      },
      "http_types": {
        "security4": 9
      }
    }
  }
}

How many daily indices do you have? How long does the state stay yellow?

Except system indices, I have about 100 daily indices, but they are created by logstash.

I think the indices from logstash are created in the mid-night of china.

And is there anything abnormal from the cluster stats info?

The yellow status not stay very long, but it was catch by the check script just at that moment.

Then you have a lot of indices created at basically the same time every day, which is why they take time to allocate. I would recommend either consolidating a lot of these indices into fewer larger ones, which would mean fewer indices and shards would need to be allocated at the same time, or switch to using rollover for indices that does not need to be updated. Using rollover allows you to create new indices based on age or size which means they will not all be created at the same time.

We seems out of the way.

The 100 daily indices of nginx log are created by logstash in the mid-night of china.

And the current status appear in the morning at 8:00, maybe there are also many system indices?

Logstash by default creates index names based on UTC. How are you setting index name in Logstash? Are you not using the default behavior?

There should not be many system indices so I would not expect those to cause this type of problem.

I change the time to the local time in logstash configuration.

But what happened lead to the block?

There are a lot of put mapping tasks in the queue that seems to slow down the shard creation and allocation. Are you using dynamic mappings? Do you have indices with very large mappings and number of fields? All these require cluster state updates and would impact eachother as they need to be processed sequentially.

Note that version 6.3 is very old and I believe there has been improvements in this area. I would therefore recommend that you upgrade.

I do not set any dynamic mappings for the system indices like [.monitoring-kibana-6-2021.03.23] manually, these behaviors should perform by default.

Maybe I can find out the mappings and do something to change this behavior?

This is not related to the mappings of the monitoring indices. Are you using dynamic mappings for the indices created by Logstash? Do these have large mappings and/or large number of fields? Do you have any index templates in place for the Logstash created indices?

If the large number of indices created by Logstash are constantly updating mappings, this will result in a large number of cluster state updates which will affect the creation of the system indices.

The daily indices have 52 fields, I do not think it is a big number, and I do not use dynamic settings.

I can get the index template in kibana7.11 but in kibana6.3, I did not modify the index template.

Yes. I remember that some indices might wrongly report a weird status at creation time. Not sur if it's the case here but it's definitely worth upgrading at the very least to 6.8.x or better to 7.11.2.

1 Like

It is not easy to upgrade a production cluster with large data, I will check the documents and evaluate the feasibility.

Thank you all!!!