Normal for node to be RED after reboot; standalone environment

Hello - first time poster here and new ES user. Been troubleshooting for a couple days now and after reading numerous posts and ES docs, I must say this is pretty amazing stuff and the ES team + community seem quite amazing. Anyway, probably a really-noob question but here goes:

In a standalone environment, following a proper shutdown or reboot (using shutdown or reboot commands), is it expected for all of my indices to be in the red state thereby resulting in my node being in a red state? The reason I ask is despite the node being green prior to the reboot, it always goes back to red after the reboot.

As ES is repairing itself, I continuously check the red and green index counts and after a couple of hours, when the last index goes from red >>> green, the node also goes red >>> green and all seems well. However, in testing recovery operations, I find that after a graceful shutdown\restart, all indices are red again and the whole rebuild operation is repeated. I figured this might be expected behavior due to this being a standalone environment but I have not yet been able to confirm if this is the case so thought I would ask here.

Thanks in advance for your help.

Update #1

Seeing the following error when the node restarts but even after enabling DEBUG level logging, none of the messages near this error indicate why a recovery is required. Again, the node begins recovery on its own and after 90-120m, the node health goes green as all indices move to green and shards get assigned.

[2022-04-07T17:01:46,893][WARN ][rest.suppressed          ] path: /_security/role/limited-auditor, params: {name=limited-auditor}
org.elasticsearch.ElasticsearchStatusException: Cluster state has not been recovered yet, cannot write to the [null] index

After searching high-and-low, I did find one related issue that involves being unable to set a password but because I haven't seen any evidence of that in our setup, I'm pretty sure that isn't my problem. Aside from that, I haven't been able to find anything that covers why the node health would be red immediately after start up when literally 1 minute earlier, just prior to shutdown, it was green.

One last datapoint --- this is running within a Docker container. Not sure that is relevant but thought I would mention it anyway.

Totally tearing my hair out on this so thanks in advance for any ideas or feedback.

What is the full output of the cluster stats API? What type of hardware are you deploying on? What type of storage are you using?

Hi @Christian_Dahlqvist,

Here is the platform info followed by the cluster stats:

Hardware: Dell Poweredge R710 + VMWare ESXi 6.5
CPU: 8x vCPU
RAM: 32GB
Swap: 5GB
Storage - OS: 500GB; xfs; spindle drives; no RAID
Storage - data: 1TB; xfs; spindle drives; no RAID


Cluster stats

  • The node state is red because this was taken roughly 1h after ES restart and the node is still in the middle of the auto-recovery process. The node typically goes green at about 90m after restart.
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "tyr",
  "cluster_uuid" : "zN5mLhsIQieyGN-o45xaxg",
  "timestamp" : 1649398249043,
  "status" : "red",
  "indices" : {
    "count" : 938,
    "shards" : {
      "total" : 1386,
      "primaries" : 1386,
      "replication" : 0.0,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.4776119402985075
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.4776119402985075
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.0,
          "avg" : 0.0
        }
      }
    },
    "docs" : {
      "count" : 226592182,
      "deleted" : 16922
    },
    "store" : {
      "size_in_bytes" : 229489102682,
      "total_data_set_size_in_bytes" : 229489102682,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 0,
      "total_count" : 0,
      "hit_count" : 0,
      "miss_count" : 0,
      "cache_size" : 0,
      "cache_count" : 0,
      "evictions" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 2014,
      "memory_in_bytes" : 93979976,
      "terms_memory_in_bytes" : 74136256,
      "stored_fields_memory_in_bytes" : 1041280,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 12202112,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 6600328,
      "index_writer_memory_in_bytes" : 9045172,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 2820064,
      "max_unsafe_auto_id_timestamp" : 1649393236055,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 7441,
          "index_count" : 529,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 348,
          "index_count" : 116,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 9889,
          "index_count" : 1425,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 377,
          "index_count" : 29,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 1176,
          "index_count" : 118,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 4116,
          "index_count" : 831,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 4807,
          "index_count" : 1409,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 7758,
          "index_count" : 1293,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 1505,
          "index_count" : 635,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 8980,
          "index_count" : 1409,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 209041,
          "index_count" : 1424,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 31300,
          "index_count" : 1420,
          "script_count" : 0
        },
        {
          "name" : "match_only_text",
          "count" : 1392,
          "index_count" : 116,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 1281,
          "index_count" : 121,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 199726,
          "index_count" : 1423,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 116,
          "index_count" : 116,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 276107,
          "index_count" : 1420,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 1972,
          "index_count" : 116,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [
        {
          "name" : "pattern_replace",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "tokenizer_types" : [
        {
          "name" : "path_hierarchy",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1411,
          "index_count" : 1411
        },
        {
          "name" : "trim",
          "count" : 1411,
          "index_count" : 1411
        }
      ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.13.4",
        "index_count" : 1312,
        "primary_shard_count" : 2032,
        "total_primary_bytes" : 214235421067
      },
      {
        "version" : "7.17.1",
        "index_count" : 124,
        "primary_shard_count" : 124,
        "total_primary_bytes" : 15253681615
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 1,
      "coordinating_only" : 0,
      "data" : 1,
      "data_cold" : 1,
      "data_content" : 1,
      "data_frozen" : 1,
      "data_hot" : 1,
      "data_warm" : 1,
      "ingest" : 1,
      "master" : 1,
      "ml" : 0,
      "remote_cluster_client" : 1,
      "transform" : 1,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.1"
    ],
    "os" : {
      "available_processors" : 8,
      "allocated_processors" : 8,
      "names" : [
        {
          "name" : "Linux",
          "count" : 1
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.4 LTS",
          "count" : 1
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 1
        }
      ],
      "mem" : {
        "total_in_bytes" : 33566515200,
        "free_in_bytes" : 623423488,
        "used_in_bytes" : 32943091712,
        "free_percent" : 2,
        "used_percent" : 98
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 15
      },
      "open_file_descriptors" : {
        "min" : 3976,
        "max" : 3976,
        "avg" : 3976
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 5080511,
      "versions" : [
        {
          "version" : "17.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.2+8",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 4319244448,
        "heap_max_in_bytes" : 5687476224
      },
      "threads" : 197
    },
    "fs" : {
      "total_in_bytes" : 1098970566656,
      "free_in_bytes" : 121396645888,
      "available_in_bytes" : 121396645888
    },
    "plugins" : [
      {
        "name" : "repository-s3",
        "version" : "7.17.1",
        "elasticsearch_version" : "7.17.1",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 1
      },
      "http_types" : {
        "security4" : 1
      }
    },
    "discovery_types" : {
      "single-node" : 1
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 1
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 509,
      "processor_stats" : {
        "append" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "community_id" : {
          "count" : 42992,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 5570
        },
        "conditional" : {
          "count" : 7956575,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 1826260
        },
        "convert" : {
          "count" : 144771,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 4980
        },
        "csv" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "date" : {
          "count" : 43076,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 4310
        },
        "date_index_name" : {
          "count" : 48257,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 6616
        },
        "dissect" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "dot_expander" : {
          "count" : 198903,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 407
        },
        "foreach" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 193028,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 10093
        },
        "grok" : {
          "count" : 7728124,
          "failed" : 1,
          "current" : 0,
          "time_in_millis" : 1330184
        },
        "gsub" : {
          "count" : 92,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 3
        },
        "join" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "json" : {
          "count" : 48329,
          "failed" : 92,
          "current" : 0,
          "time_in_millis" : 1102
        },
        "kv" : {
          "count" : 14340,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 43
        },
        "lowercase" : {
          "count" : 48257,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 86
        },
        "pipeline" : {
          "count" : 98425,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 86271
        },
        "remove" : {
          "count" : 11768085,
          "failed" : 5076,
          "current" : 0,
          "time_in_millis" : 23474
        },
        "rename" : {
          "count" : 5304403,
          "failed" : 195683,
          "current" : 0,
          "time_in_millis" : 20411
        },
        "script" : {
          "count" : 7793468,
          "failed" : 22140,
          "current" : 0,
          "time_in_millis" : 153679
        },
        "set" : {
          "count" : 19377253,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 338507
        },
        "split" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "trim" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "uppercase" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "uri_parts" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "urldecode" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

You have a lot of very small shards, which can be very inefficient. I would recommend looking into how you shard data and look to reduce the number of shards significantly. Look at disk I/O and iowait when you are starting up and make sure disk performance is not the bottleneck. Also make sure you are not using swap.

Thanks @Christian_Dahlqvist - I will research the points you have outlined but also, it would be good to understand why disk I\O and smaller shards would only be an issue at ES startup especially when the node was 100% healthy less than 1m prior. The node runs great and is healthy at all other times, even under heavy load, so why disk\shards would be an issue only at startup isn't intuitive to a new user like myself.

Also, if disk I\O, small shards, etc. are a problem, should there be relevant log messages that would point to that? Again, I'm not seeing anything along those line so want to make sure I know what to expect.

To answer your original question, yes, it's normal for the cluster health to be RED at startup. Elasticsearch doesn't know what might have happened to its data while it was stopped so it cannot simply jump back to a state in which all shards are started. It must start up each shard from cold, and this involves a certain amount of I/O to perform integrity checks and warm up caches and so on.

1386 shards is quite a lot for a single node. You say you didn't see any indication that this might be a problem, but I think you must have deliberately relaxed certain safety limits (e.g. cluster.max_shards_per_node) to get to this state. Those limits are designed to warn you that you're heading outside of a recommended setup; by relaxing them you're acknowledging that you accept that you're taking these risks.

That said, this is not a totally daft number of shards and I am a little surprised to hear it is taking 90 minutes to start up the node. That works out as many seconds per shard on average. It looks like you are running on spinning disks which are really bad at concurrent random IO, and Elasticsearch will be starting up to 4 shards concurrently which is going to involve a lot of concurrent random IO, exactly the wrong workload for a spinning disk. I expect the simplest way to get a more reasonable startup time would be to replace your data drive with a SSD.

Elasticsearch doesn't really have strong opinions about performance, it mostly just does the work it needs to do as fast as the underlying system allows. It will emit some warnings if certain I/O operations are slow enough to indicate that the system is just broken, but I expect your situation isn't bad enough to trigger them. It's up to you to set performance goals, and investigate when they are not met.

Hi @DavidTurner - This is excellent feedback - thank you.

So knowing that it is normal for the node to be red at startup helps a lot; at least now I know I shouldn't be chasing down that specific issue. The goal now is to try and reduce the time needed to go from red >>> green.

So, our setup ingests data for several NSM related services that is currently rolled-over on a daily basis. Also, we initially had some indexes setup to use 5 shards per index but this has recently been changed to only have 1 shard per index. I have reviewed the Elastic white-paper entitled "And the big one said 'Rollover' — Managing Elasticsearch time-based indices efficiently" but have not yet implemented size-based rollover (need to research more here).

So, with all that in mind, I have a couple of additional questions:

  • Do you think implementing size-based rollover will be sufficient to speed up my ES node health recovery or are there other things I can dig into?

  • Is there anything out there that discusses the ES startup process in detail? I have read the "ES Definitive Guide" (amazing book) but it seems to cover ES startup at a fairly high level so anything that is down more in the weeds would be great.

Again thanks for your time and assistance on this - very much appreciated.

I expect it'll help although I can't say how much, and only you can judge what "sufficient" means here.

Not that I'm aware of. These things change from version to version and typically don't matter much to end-users so it doesn't really make sense to describe this outside of the code itself.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.