Master not discovered yet

Hello,

actually I've got some problems with my elasticsearch, see below:

[2023-04-27T09:41:23,333][INFO ][o.e.c.s.MasterService    ] [kibana-com-2-rz2] node-join[{elastic-cold-com-4-rz2}{3jgTzoQdQJSpQaLaQhlMPg}{jk19BevIQJ2IMg-wzRC0QA}{10.2.6.78}{10.2.6.78:9300}{cdfhrstw}{rz=rz2, xpack.installed=true, storage=hdd, transform.node=true} join existing leader], term: 53709, version: 3765802, delta: **added {{elastic-cold-com-4-rz2}**{3jgTzoQdQJSpQaLaQhlMPg}{jk19BevIQJ2IMg-wzRC0QA}{10.2.6.78}{10.2.6.78:9300}{cdfhrstw}{rz=rz2, xpack.installed=true, storage=hdd, transform.node=true}}
[2023-04-27T09:41:24,598][INFO ][o.e.c.s.ClusterApplierService] [kibana-com-2-rz2] added {{elastic-cold-com-4-rz2}{3jgTzoQdQJSpQaLaQhlMPg}{jk19BevIQJ2IMg-wzRC0QA}{10.2.6.78}{10.2.6.78:9300}{cdfhrstw}{rz=rz2, xpack.installed=true, storage=hdd, transform.node=true}}, term: 53709, version: 3765802, reason: Publication{term=53709, version=3765802}

[2023-04-27T09:41:25,784][INFO ][o.e.c.s.MasterService ] [kibana-com-2-rz2] node-left[{elastic-cold-com-4-rz2}{3jgTzoQdQJSpQaLaQhlMPg}{jk19BevIQJ2IMg-wzRC0QA}{10.2.6.78}{10.2.6.78:9300}{cdfhrstw}{rz=rz2, xpack.installed=true, storage=hdd, transform.node=true} reason: disconnected], term: 53709, version: 3765803, delta: **removed {{elastic-cold-com-4-rz2}**{3jgTzoQdQJSpQaLaQhlMPg}{jk19BevIQJ2IMg-wzRC0QA}{10.2.6.78}{10.2.6.78:9300}{cdfhrstw}{rz=rz2, xpack.installed=true, storage=hdd, transform.node=true}}

every will added and removed again and again.

And another failure message appears:

[2023-04-27T10:28:26,330][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-cold-com-3-rz1] master not discovered yet: have discovered
[{elastic-cold-com-3-rz1}{JivunSuJRfucH1OkjOsxWw}{HCM-ywFkTye_NnoBFtGQiw}{10.1.6.78}{10.1.6.78:9300}{cdfhrstw}{xpack.installed=true, transform.node=true, rz=rz1, storage=hdd},
{kibana-com-2-rz2}{wq_mBeIIQNaM0W91nj1y-g}{aWQjOU1YSIatZkVP_wkSOA}{10.2.6.80}{10.2.6.80:9300}{mr}{rz=rz2, xpack.installed=true, storage=none, transform.node=false},
{kibana-com-3-rz3}{xfxXEWq7Sa21vcBARh0ATg}{RhFC8zF5Reu8soIYlPd40Q}{10.5.2.133}{10.5.2.133:9300}{mr}{rz=rz3, xpack.installed=true, storage=none, transform.node=false},
{kibana-com-1-rz1}{lOA9Bi8SS6i23oeE2giT4w}{e_ASNFHXS6ObS3obRVUNEQ}{10.1.6.80}{10.1.6.80:9300}{mr}{rz=rz1, xpack.installed=true, storage=none, transform.node=false}];
discovery will continue using [10.1.6.80:9300, 10.2.6.80:9300, 10.5.2.133:9300] from hosts providers and
[{kibana-com-2-rz2}{wq_mBeIIQNaM0W91nj1y-g}{aWQjOU1YSIatZkVP_wkSOA}{10.2.6.80}{10.2.6.80:9300}{mr}{rz=rz2, xpack.installed=true, storage=none, transform.node=false},
{kibana-com-3-rz3}{xfxXEWq7Sa21vcBARh0ATg}{RhFC8zF5Reu8soIYlPd40Q}{10.5.2.133}{10.5.2.133:9300}{mr}{rz=rz3, xpack.installed=true, storage=none, transform.node=false},
{kibana-com-1-rz1}{lOA9Bi8SS6i23oeE2giT4w}{e_ASNFHXS6ObS3obRVUNEQ}{10.1.6.80}{10.1.6.80:9300}{mr}{rz=rz1, xpack.installed=true, storage=none, transform.node=false}]
from last-known cluster state; node term 53709, last-accepted version 3766120 in term 53709

Have anyone an idea?
Thanks a lot.

Best
Florian

What is the output from the _cluster/stats?pretty&human API?

{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "request [/_cluster/stats] contains unrecognized parameter: [human API] -> did you mean [human]?"
}
],
"type" : "illegal_argument_exception",
"reason" : "request [/_cluster/stats] contains unrecognized parameter: [human API] -> did you mean [human]?"
},
"status" : 400
}

It's _cluster/stats?pretty&human.

{
  "_nodes" : {
    "total" : 31,
    "successful" : 31,
    "failed" : 0
  },
  "cluster_name" : "elastic-com-c1",
  "cluster_uuid" : "pk3BirT2SB-Z5MAPU5vC8A",
  "timestamp" : 1683011715967,
  "status" : "green",
  "indices" : {
    "count" : 1707,
    "shards" : {
      "total" : 9331,
      "primaries" : 7786,
      "replication" : 0.19843308502440277,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 16,
          "avg" : 5.466315172817809
        },
        "primaries" : {
          "min" : 1,
          "max" : 8,
          "avg" : 4.561218512009373
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.2606912712360867
        }
      }
    },
    "docs" : {
      "count" : 13424758097,
      "deleted" : 11367219
    },
    "store" : {
      "size" : "11.9tb",
      "size_in_bytes" : 13140973966412,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "149.6kb",
      "memory_size_in_bytes" : 153224,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "160.8mb",
      "memory_size_in_bytes" : 168653072,
      "total_count" : 183881131,
      "hit_count" : 2382505,
      "miss_count" : 181498626,
      "cache_size" : 246265,
      "cache_count" : 320696,
      "evictions" : 74431
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 145117,
      "memory" : "5.5gb",
      "memory_in_bytes" : 5995052428,
      "terms_memory" : "5.4gb",
      "terms_memory_in_bytes" : 5844262264,
      "stored_fields_memory" : "74.2mb",
      "stored_fields_memory_in_bytes" : 77878472,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "46.9kb",
      "norms_memory_in_bytes" : 48064,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "69.4mb",
      "doc_values_memory_in_bytes" : 72863628,
      "index_writer_memory" : "1.1gb",
      "index_writer_memory_in_bytes" : 1274642416,
      "version_map_memory" : "2.8mb",
      "version_map_memory_in_bytes" : 2967673,
      "fixed_bit_set" : "14.3mb",
      "fixed_bit_set_memory_in_bytes" : 15091360,
      "max_unsafe_auto_id_timestamp" : 1683010130122,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 754,
          "index_count" : 734
        },
        {
          "name" : "date",
          "count" : 1847,
          "index_count" : 1643
        },
        {
          "name" : "float",
          "count" : 50,
          "index_count" : 5
        },
        {
          "name" : "geo_point",
          "count" : 1619,
          "index_count" : 1619
        },
        {
          "name" : "half_float",
          "count" : 3278,
          "index_count" : 1629
        },
        {
          "name" : "integer",
          "count" : 110,
          "index_count" : 5
        },
        {
          "name" : "ip",
          "count" : 1619,
          "index_count" : 1619
        },
        {
          "name" : "keyword",
          "count" : 107229,
          "index_count" : 1643
        },
        {
          "name" : "long",
          "count" : 2123,
          "index_count" : 1248
        },
        {
          "name" : "nested",
          "count" : 27,
          "index_count" : 17
        },
        {
          "name" : "object",
          "count" : 21390,
          "index_count" : 1643
        },
        {
          "name" : "text",
          "count" : 106417,
          "index_count" : 1243
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.6.2",
        "index_count" : 4,
        "primary_shard_count" : 4,
        "total_primary_size" : "416.2kb",
        "total_primary_bytes" : 426250
      },
      {
        "version" : "7.7.1",
        "index_count" : 3,
        "primary_shard_count" : 3,
        "total_primary_size" : "15.5mb",
        "total_primary_bytes" : 16265887
      },
      {
        "version" : "7.8.0",
        "index_count" : 24,
        "primary_shard_count" : 24,
        "total_primary_size" : "9.6mb",
        "total_primary_bytes" : 10128348
      },
      {
        "version" : "7.12.0",
        "index_count" : 1676,
        "primary_shard_count" : 7755,
        "total_primary_size" : "10.5tb",
        "total_primary_bytes" : 11601058200057
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 31,
      "coordinating_only" : 0,
      "data" : 28,
      "data_cold" : 28,
      "data_content" : 28,
      "data_frozen" : 28,
      "data_hot" : 28,
      "data_warm" : 28,
      "ingest" : 6,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 31,
      "transform" : 28,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.0"
    ],
    "os" : {
      "available_processors" : 276,
      "allocated_processors" : 276,
      "names" : [
        {
          "name" : "Linux",
          "count" : 31
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Debian GNU/Linux 10 (buster)",
          "count" : 31
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 31
        }
      ],
      "mem" : {
        "total" : "567.5gb",
        "total_in_bytes" : 609426956288,
        "free" : "42.1gb",
        "free_in_bytes" : 45303988224,
        "used" : "525.3gb",
        "used_in_bytes" : 564122968064,
        "free_percent" : 7,
        "used_percent" : 93
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 389
      },
      "open_file_descriptors" : {
        "min" : 1042,
        "max" : 6196,
        "avg" : 4506
      }
    },
    "jvm" : {
      "max_uptime" : "399d",
      "max_uptime_in_millis" : 34477392936,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 31
        }
      ],
      "mem" : {
        "heap_used" : "158.2gb",
        "heap_used_in_bytes" : 169910423920,
        "heap_max" : "296gb",
        "heap_max_in_bytes" : 317827579904
      },
      "threads" : 3449
    },
    "fs" : {
      "total" : "46.9tb",
      "total_in_bytes" : 51567886049280,
      "free" : "33.9tb",
      "free_in_bytes" : 37303499055104,
      "available" : "32tb",
      "available_in_bytes" : 35202149511168
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 31
      },
      "http_types" : {
        "security4" : 31
      }
    },
    "discovery_types" : {
      "zen" : 31
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 31
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

You seem to have a tonne of shards that are very small shards on an EOL version. You should reduce your shard count and upgrade as a matter of urgency, that's also likely to help this situation.

1 Like

I can from the stats see a few issues in the cluster.

This is as Mark pointed out very old and has been EOL a long time, so you should look to upgrade as soon as possible.

A lot of your indices do not have any replica configured. This is not necessarily causing any immediate problem but is risky and can result in data loss or some data not being available if a node in the cluster is having problems.

It looks like you have only around 10GB of heap per node and a lot of very small shards. You have a significantly higher number than generally recommended for the version you are on. Handling of smaller shards is something that has been improved in version 8.3, so yet another reason to upgrade to the latest version.

It looks like you have 28 data nodes and 3 dedicated master nodes. Only 6 nodes are however configured as ingest nodes, which could lead to imbalances and performance issues if you do use a lot of ingest pipelines. I would recommend making all data nodes able to run ingest pipelines.

In addition to this I would recommend you look at the logs of your dedicated master nodes and check if these are experiencing GC issues or whether it is taking a long time to propagate cluster state updates. This could indicate that your cluster is overloaded.

Thanks a lot for your response.
Upgrade to 8 include non freeware license, or is it still free?

Just as with version 7.12 there is a free Basic license tier with version 8.7 so there should be no change to licensing terms as far as I know.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.