Master_not_discovered_exception

With the version 7.15.2 from ubuntu package, I deployed 4 ES nodes in 2 VMs, so 2 nodes fo r each VM. Names of nodes(node.name) are following:

  • es-01-1(9301)
  • es-01-2(9302)
  • es-02-1(9301)
  • es-02-2(9302)

Where the prefix, es-# stands VM, and the suffix -# means node.
Set transport ports each node differently like above.
And also configured TLS with the xpack security basic plan.

The two nodes in the first VM seem OK.

But the others in the second VM are saying the same log repeatedly like:

# from node es02-1
[2022-04-08T14:06:21,588][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{51y8ZOSuQKSQzs7r23-5UA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {es-01-1}{si-Zgi6CSxy8Vj5XnrtkBA}{MPGB5Q-rQNa9sk2ZHEvY0A}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {es-01-2}{CwWoO1RlTdGfZsUI8aS4zQ}{OoTtcvbeS7ad6X1pGtbbDw}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {es-02-2}{au75-y-xSV6bADfPwnxf1g}{4ID_TFaJQdaQAoR-pW_yIg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{51y8ZOSuQKSQzs7r23-5UA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
...

# from the es-02-2
[2022-04-08T14:18:39,458][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-02-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [au75-y-xSV6bADfPwnxf1g, 7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q], have discovered possible quorum [{es-02-2}{au75-y-xSV6bADfPwnxf1g}{4ID_TFaJQdaQAoR-pW_yIg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}, {es-01-1}{si-Zgi6CSxy8Vj5XnrtkBA}{MPGB5Q-rQNa9sk2ZHEvY0A}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {es-01-2}{CwWoO1RlTdGfZsUI8aS4zQ}{OoTtcvbeS7ad6X1pGtbbDw}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{51y8ZOSuQKSQzs7r23-5UA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9301] from hosts providers and [{es-02-2}{au75-y-xSV6bADfPwnxf1g}{4ID_TFaJQdaQAoR-pW_yIg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}] from last-known cluster state; node term 24, last-accepted version 1542 in term 24
...

Why they thought the FlgasqSnRTK3GuJ1EVN8(both) and 8fxmMWPPTHKg118NPnBy7Q(only es-02-1) are quorum?
And why the real nodes like, {es-01-1}{si-Zgi6CSxy8Vj5XnrtkBA} or {es-01-2}{CwWoO1RlTdGfZsUI8aS4zQ} are not considered as quorum or just as a possible one?

Is it a split brain?

Please could you share a more complete set of logs from all 4 nodes? Make sure that the logs you share cover at least 10 minutes, i.e. the same 10-minute period is included in each log file.

Here are. Nodes es-01-1 and es-01-2, which are in the same VM, are seems OK(The logs are after from some failures to download the geo-DB, which are omitted, but it is done eventually as see):

# es-01-1
[2022-04-07T16:44:45,391][INFO ][o.e.x.s.a.TokenService   ] [crema-es-01-1] refresh keys
[2022-04-07T16:44:45,598][INFO ][o.e.x.s.a.TokenService   ] [crema-es-01-1] refreshed keys
[2022-04-07T16:44:45,746][INFO ][o.e.l.LicenseService     ] [crema-es-01-1] license [f3a86413-7dbc-4c39-b406-98424da0271a] mode [basic] - valid
[2022-04-07T16:44:45,749][INFO ][o.e.x.s.a.Realms         ] [crema-es-01-1] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-04-07T16:44:45,750][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [crema-es-01-1] Active license is now [BASIC]; Security is enabled
[2022-04-07T16:44:45,788][INFO ][o.e.h.AbstractHttpServerTransport] [crema-es-01-1] publish_address {192.168.200.10:9201}, bound_addresses {192.168.200.10:9201}
[2022-04-07T16:44:45,790][INFO ][o.e.n.Node               ] [crema-es-01-1] started
[2022-04-07T16:44:46,309][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] downloading geoip database [GeoLite2-Country.mmdb] to [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-Country.mmdb.tmp.gz]
[2022-04-07T16:44:46,318][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] downloading geoip database [GeoLite2-City.mmdb] to [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-City.mmdb.tmp.gz]
[2022-04-07T16:44:46,322][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] downloading geoip database [GeoLite2-ASN.mmdb] to [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-ASN.mmdb.tmp.gz]
[2022-04-07T16:44:47,473][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] successfully reloaded changed geoip database file [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-Country.mmdb]
[2022-04-07T16:44:47,679][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] successfully reloaded changed geoip database file [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-ASN.mmdb]
[2022-04-07T16:44:53,029][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-1] successfully reloaded changed geoip database file [/tmp/elasticsearch-7825820029223194837/geoip-databases/ck-vVKQIQWiDHQo10QNgbQ/GeoLite2-City.mmdb]
# es-01-2
[2022-04-07T16:44:11,673][INFO ][o.e.x.s.a.TokenService   ] [crema-es-01-2] refresh keys
[2022-04-07T16:44:11,882][INFO ][o.e.x.s.a.TokenService   ] [crema-es-01-2] refreshed keys
[2022-04-07T16:44:11,998][INFO ][o.e.l.LicenseService     ] [crema-es-01-2] license [f3a86413-7dbc-4c39-b406-98424da0271a] mode [basic] - valid
[2022-04-07T16:44:12,000][INFO ][o.e.x.s.a.Realms         ] [crema-es-01-2] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-04-07T16:44:12,001][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [crema-es-01-2] Active license is now [BASIC]; Security is enabled
[2022-04-07T16:44:12,034][INFO ][o.e.h.AbstractHttpServerTransport] [crema-es-01-2] publish_address {192.168.200.10:9202}, bound_addresses {192.168.200.10:9202}
[2022-04-07T16:44:12,037][INFO ][o.e.n.Node               ] [crema-es-01-2] started
[2022-04-07T16:44:12,376][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] downloading geoip database [GeoLite2-Country.mmdb] to [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-Country.mmdb.tmp.gz]
[2022-04-07T16:44:12,386][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] downloading geoip database [GeoLite2-City.mmdb] to [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-City.mmdb.tmp.gz]
[2022-04-07T16:44:12,387][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] downloading geoip database [GeoLite2-ASN.mmdb] to [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-ASN.mmdb.tmp.gz]
[2022-04-07T16:44:13,238][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] successfully reloaded changed geoip database file [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-Country.mmdb]
[2022-04-07T16:44:13,376][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] successfully reloaded changed geoip database file [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-ASN.mmdb]
[2022-04-07T16:44:17,694][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-01-2] successfully reloaded changed geoip database file [/tmp/elasticsearch-15740498085232538048/geoip-databases/HtRj8DfNRdetS6oNf2q70Q/GeoLite2-City.mmdb]
[2022-04-07T16:44:44,647][INFO ][o.e.c.s.ClusterApplierService] [crema-es-01-2] added {{crema-es-01-1}{ck-vVKQIQWiDHQo10QNgbQ}{CP9c3usbTjmlSaqgRJFjew}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}}, term: 17, version: 1131, reason: ApplyCommitRequest{term=17, version=1131, sourceNode={crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{JlZgQA13ROSyki2DS13XDQ}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3133280256, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}}

es-02-1 and es-02-2, which are in the other VM, are say the same log, I said, over 10 minutes:

# es-02-1
[2022-04-11T10:53:58,150][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [crema-es-02-1] [controller/17049] [Main.cc@122] controller (64 bit): Version 7.15.2 (Build 65497bb5299534) Copyright (c) 2021 Elasticsearch BV
[2022-04-11T10:53:59,216][INFO ][o.e.x.s.a.Realms         ] [crema-es-02-1] license mode is [trial], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-04-11T10:54:01,243][INFO ][o.e.i.g.LocalDatabases   ] [crema-es-02-1] initialized default databases [[GeoLite2-Country.mmdb, GeoLite2-City.mmdb, GeoLite2-ASN.mmdb]], config databases [[]] and watching [/etc/elasticsearch1/ingest-geoip] for changes
[2022-04-11T10:54:01,257][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-02-1] initialized database registry, using geoip-databases directory [/tmp/elasticsearch-8882953503902859882/geoip-databases/7BDNjBbYSy2DgDvUdZ95MQ]
[2022-04-11T10:54:03,049][INFO ][o.e.t.NettyAllocator     ] [crema-es-02-1] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=1mb, factors={es.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=4mb, heap_size=1gb}]
[2022-04-11T10:54:03,213][INFO ][o.e.d.DiscoveryModule    ] [crema-es-02-1] using discovery type [zen] and seed hosts providers [settings]
[2022-04-11T10:54:04,541][INFO ][o.e.g.DanglingIndicesState] [crema-es-02-1] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2022-04-11T10:54:05,721][INFO ][o.e.n.Node               ] [crema-es-02-1] initialized
[2022-04-11T10:54:05,738][INFO ][o.e.n.Node               ] [crema-es-02-1] starting ...
[2022-04-11T10:54:05,808][INFO ][o.e.x.s.c.f.PersistentCache] [crema-es-02-1] persistent cache index loaded
[2022-04-11T10:54:06,060][INFO ][o.e.t.TransportService   ] [crema-es-02-1] publish_address {192.168.200.20:9301}, bound_addresses {192.168.200.20:9301}
[2022-04-11T10:54:08,225][INFO ][o.e.b.BootstrapChecks    ] [crema-es-02-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2022-04-11T10:54:08,235][INFO ][o.e.c.c.Coordinator      ] [crema-es-02-1] cluster UUID [nA5JMB2bQnCu4vTbhb6k2A]
[2022-04-11T10:54:18,276][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{VrCOrxbQQMKyamp__XJw3w}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{0UR5n91ORMKCYNpQamD2Lg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
...
[2022-04-11T11:11:19,093][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{VrCOrxbQQMKyamp__XJw3w}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{0UR5n91ORMKCYNpQamD2Lg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
# es-02-2
[2022-04-11T10:54:09,238][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [crema-es-02-2] [controller/17306] [Main.cc@122] controller (64 bit): Version 7.15.2 (Build 65497bb5299534) Copyright (c) 2021 Elasticsearch BV
[2022-04-11T10:54:10,960][INFO ][o.e.x.s.a.Realms         ] [crema-es-02-2] license mode is [trial], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-04-11T10:54:12,272][INFO ][o.e.i.g.LocalDatabases   ] [crema-es-02-2] initialized default databases [[GeoLite2-Country.mmdb, GeoLite2-City.mmdb, GeoLite2-ASN.mmdb]], config databases [[]] and watching [/etc/elasticsearch2/ingest-geoip] for changes
[2022-04-11T10:54:12,278][INFO ][o.e.i.g.DatabaseRegistry ] [crema-es-02-2] initialized database registry, using geoip-databases directory [/tmp/elasticsearch-16860011800677364149/geoip-databases/au75-y-xSV6bADfPwnxf1g]
[2022-04-11T10:54:13,806][INFO ][o.e.t.NettyAllocator     ] [crema-es-02-2] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=1mb, factors={es.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=4mb, heap_size=1gb}]
[2022-04-11T10:54:13,956][INFO ][o.e.d.DiscoveryModule    ] [crema-es-02-2] using discovery type [zen] and seed hosts providers [settings]
[2022-04-11T10:54:14,561][INFO ][o.e.g.DanglingIndicesState] [crema-es-02-2] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2022-04-11T10:54:15,061][INFO ][o.e.n.Node               ] [crema-es-02-2] initialized
[2022-04-11T10:54:15,062][INFO ][o.e.n.Node               ] [crema-es-02-2] starting ...
[2022-04-11T10:54:15,090][INFO ][o.e.x.s.c.f.PersistentCache] [crema-es-02-2] persistent cache index loaded
[2022-04-11T10:54:15,206][INFO ][o.e.t.TransportService   ] [crema-es-02-2] publish_address {192.168.200.20:9302}, bound_addresses {192.168.200.20:9302}
[2022-04-11T10:54:16,106][INFO ][o.e.b.BootstrapChecks    ] [crema-es-02-2] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2022-04-11T10:54:16,110][INFO ][o.e.c.c.Coordinator      ] [crema-es-02-2] cluster UUID [nA5JMB2bQnCu4vTbhb6k2A]
[2022-04-11T10:54:26,162][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [au75-y-xSV6bADfPwnxf1g, 7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q], have discovered possible quorum [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{VrCOrxbQQMKyamp__XJw3w}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{0UR5n91ORMKCYNpQamD2Lg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9301] from hosts providers and [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}] from last-known cluster state; node term 26, last-accepted version 1542 in term 24
...
[2022-04-11T11:19:17,690][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [au75-y-xSV6bADfPwnxf1g, 7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q], have discovered possible quorum [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{VrCOrxbQQMKyamp__XJw3w}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{0UR5n91ORMKCYNpQamD2Lg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{1hGzqDiWT-Gj-_jMBzlEbw}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9301] from hosts providers and [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{tE22GiiRQGCulGBHxx7aKw}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}] from last-known cluster state; node term 26, last-accepted version 1542 in term 24

These are not compete logs, and I believe the explanation for your problem is in the messages you are eliding.

What does your node configurations look like?

OK. But the full logs are too large to paste here, about 106M, how could I upload them?

Ok I wasn't expecting them to be so big. I sent you an upload link in a private message.

Configurations are similar with each other except the TLS certification. The first comment of each file is the node name:

# es01-1
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: es-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-01-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
node.attr.server_id: server1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/elasticsearch/elasticsearch1/data
#
# Path to log files:
#
path.logs: /mnt/elasticsearch/elasticsearch1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: False
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 192.168.200.10
#
# Set a custom port for HTTP:
#
http.port: 9201
transport.port: 9301
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ['192.168.200.10:9301', '192.168.200.10:9302', '192.168.200.20:9301', '192.168.200.20:9302']
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ['es-01-1', 'es-02-1']
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 1
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
action.destructive_requires_name: true
#
# Extra Configuration
cluster.routing.allocation.awareness.attributes: server_id
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
cluster.routing.allocation.disk.watermark.high: 1gb
cluster.routing.allocation.disk.watermark.low: 2gb
http.cors.allow-credentials: true
http.cors.allow-headers: X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
http.cors.allow-origin: '*'
http.cors.enabled: true
reindex.remote.whitelist: 172.20.6.104:9200
xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch1/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch1/elastic-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate

# es01-2
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: es-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-01-2
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
node.attr.server_id: server1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/elasticsearch/elasticsearch2/data
#
# Path to log files:
#
path.logs: /mnt/elasticsearch/elasticsearch2/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: False
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 192.168.200.10
#
# Set a custom port for HTTP:
#
http.port: 9202
transport.port: 9302
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ['192.168.200.10:9301', '192.168.200.10:9302', '192.168.200.20:9301', '192.168.200.20:9302']
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ['es-01-1', 'es-02-1']
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 1
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
action.destructive_requires_name: true
#
# Extra Configuration
cluster.routing.allocation.awareness.attributes: server_id
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
cluster.routing.allocation.disk.watermark.high: 1gb
cluster.routing.allocation.disk.watermark.low: 2gb
http.cors.allow-credentials: true
http.cors.allow-headers: X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
http.cors.allow-origin: '*'
http.cors.enabled: true
reindex.remote.whitelist: 172.20.6.104:9200
xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch2/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch2/elastic-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate
# es02-1
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: es-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-02-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
node.attr.server_id: server2
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/es/elasticsearch1/data
#
# Path to log files:
#
path.logs: /mnt/es/elasticsearch1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: False
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 192.168.200.20
#
# Set a custom port for HTTP:
#
http.port: 9201
transport.port: 9301
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ['192.168.200.10:9301', '192.168.200.10:9302', '192.168.200.20:9301', '192.168.200.20:9302']
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ['es-01-1']
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 1
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
action.destructive_requires_name: true
#
# Extra Configuration
cluster.routing.allocation.awareness.attributes: server_id
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
cluster.routing.allocation.disk.watermark.high: 1gb
cluster.routing.allocation.disk.watermark.low: 2gb
http.cors.allow-credentials: true
http.cors.allow-headers: X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
http.cors.allow-origin: '*'
http.cors.enabled: true
reindex.remote.whitelist: 172.20.6.104:9200
xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch1/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch1/elastic-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate
# es-02-2
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: es-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-02-2
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
node.attr.server_id: server2
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/es/elasticsearch2/data
#
# Path to log files:
#
path.logs: /mnt/es/elasticsearch2/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: False
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 192.168.200.20
#
# Set a custom port for HTTP:
#
http.port: 9202
transport.port: 9302
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ['192.168.200.10:9301', '192.168.200.10:9302', '192.168.200.20:9301', '192.168.200.20:9302']
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ['es-01-1']
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 1
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
action.destructive_requires_name: true
#
# Extra Configuration
cluster.routing.allocation.awareness.attributes: server_id
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
cluster.routing.allocation.disk.watermark.high: 1gb
cluster.routing.allocation.disk.watermark.low: 2gb
http.cors.allow-credentials: true
http.cors.allow-headers: X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
http.cors.allow-origin: '*'
http.cors.enabled: true
reindex.remote.whitelist: 172.20.6.104:9200
xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch2/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch2/elastic-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate

I think you sent es01-1 and es01-2 twice, could you send es02-1 and es02-2 as well?

1 Like

This looks very strange, indicating that something is stuck:

[2022-04-11T15:45:50,194][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{Fjjan_J7S6WHRzNeajWVCg}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
[2022-04-11T15:46:06,340][WARN ][o.e.t.ThreadPool         ] [crema-es-02-1] timer thread slept for [10.7s/10726ms] on absolute clock which is above the warn threshold of [5000ms]
[2022-04-11T15:46:06,356][WARN ][o.e.t.ThreadPool         ] [crema-es-02-1] timer thread slept for [10.7s/10726043030ns] on relative clock which is above the warn threshold of [5000ms]
[2022-04-11T15:46:06,368][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{Fjjan_J7S6WHRzNeajWVCg}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
[2022-04-11T15:46:06,373][WARN ][o.e.t.TransportService   ] [crema-es-02-1] Received response for a request that has timed out, sent [12.3s/12374ms] ago, timed out [0s/0ms] ago, action [internal:discovery/request_peers], node [{crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}], id [970]
[2022-04-11T15:46:06,768][WARN ][o.e.t.OutboundHandler    ] [crema-es-02-1] sending transport message [Response{1006}{false}{false}{false}{class org.elasticsearch.cluster.coordination.PeersResponse}] of size [1373] on [Netty4TcpChannel{localAddress=/192.168.200.20:9301, remoteAddress=/192.168.200.10:37392, profile=default}] took [11097ms] which is above the warn threshold of [5000ms] with success [true]
[2022-04-11T15:46:06,345][WARN ][o.e.d.PeerFinder         ] [crema-es-02-1] address [192.168.200.10:9302], node [{crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}], requesting [false] peers request failed
org.elasticsearch.transport.ReceiveTimeoutTransportException: [crema-es-01-2][192.168.200.10:9302][internal:discovery/request_peers] request_id [970] timed out after [12374ms]
[2022-04-11T15:46:16,395][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{Fjjan_J7S6WHRzNeajWVCg}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29
[2022-04-11T15:46:24,397][WARN ][o.e.d.PeerFinder         ] [crema-es-02-1] address [192.168.200.20:9302], node [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], requesting [false] peers request failed
org.elasticsearch.transport.ReceiveTimeoutTransportException: [crema-es-02-2][192.168.200.20:9302][internal:discovery/request_peers] request_id [1013] timed out after [4838ms]
[2022-04-11T15:46:24,491][WARN ][o.e.t.TransportService   ] [crema-es-02-1] Received response for a request that has timed out, sent [4.8s/4838ms] ago, timed out [0s/0ms] ago, action [internal:discovery/request_peers], node [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], id [1013]
[2022-04-11T15:46:26,398][WARN ][o.e.c.c.ClusterFormationFailureHelper] [crema-es-02-1] master not discovered or elected yet, an election requires at least 2 nodes with ids from [7BDNjBbYSy2DgDvUdZ95MQ, FlgasqSnRTK3GuJ1EVN8-Q, 8fxmMWPPTHKg118NPnBy7Q], have only discovered non-quorum [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}, {crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{Fjjan_J7S6WHRzNeajWVCg}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{1UzD_kQPTGyA10lXMDkGmA}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{KWtoIIO5TBOvhfymXe5Y7A}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}]; discovery will continue using [192.168.200.10:9301, 192.168.200.10:9302, 192.168.200.20:9302] from hosts providers and [{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{yTpSSR6_SC6u5l7-Q1iczA}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}] from last-known cluster state; node term 29, last-accepted version 1651 in term 29

However I was expecting more clues. Could you capture a stack dump of es-02-1 with jstack?

Their sizes exceed the limit again, I sent them to your link. Outputs of es02-* with jstack openjdk11. I'm also trying to figure out, but it is the first time I capture thread dumps.

Thanks, they were captured correctly it seems but don't indicate any problems. Report pending joins in ClusterFormationFailureHelper by DaveCTurner · Pull Request #85635 · elastic/elasticsearch · GitHub adds the details I'd want to see in the logs but that's obviously no help here. Could you capture stack dumps on all nodes, not just the es02 ones?

Hmm yep nothing useful there either.

Could you set these settings in elasticsearch.yml on each node, then restart all the nodes, wait for ~10min, then remove these settings and restart them all again?

logger.org.elasticsearch.discovery: TRACE
logger.org.elasticsearch.cluster.coordination: TRACE

Finally, share the resulting logs to the same place.

So you mean 8 logs in total?

  • 4 are as enable the settings for 4 nodes
  • Then 4 are as disable the settings for 4 nodes

I only need to see the logs while the logger.*: TRACE settings are configured.

Only difference in es02-*'s, probing in trace logs.

The masterNode=Optional.empty in these log lines indicates that all four nodes are reporting to crema-es-02-1 that they don't have a master:

[2022-04-12T01:44:14,724][TRACE][o.e.d.PeerFinder         ] [crema-es-02-1] address [192.168.200.10:9302], node [{crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{F4c7kMnZQeeyTxDP2_k8xg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}], requesting [true] received PeersResponse{masterNode=Optional.empty, knownPeers=[{crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{f1DPQ30mTXGw-6e0muSyTA}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}, {crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{CHbud40ZTry8eMpAZWQ4JQ}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{rrUbg1JzS5SZFUFY1Iqdpg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], term=0}
[2022-04-12T01:44:14,725][TRACE][o.e.d.PeerFinder         ] [crema-es-02-1] startProbe(192.168.200.20:9301) not probing local node
[2022-04-12T01:44:14,724][TRACE][o.e.d.PeerFinder         ] [crema-es-02-1] address [192.168.200.10:9301], node [{crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{f1DPQ30mTXGw-6e0muSyTA}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}], requesting [true] received PeersResponse{masterNode=Optional.empty, knownPeers=[{crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{CHbud40ZTry8eMpAZWQ4JQ}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{F4c7kMnZQeeyTxDP2_k8xg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}, {crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{rrUbg1JzS5SZFUFY1Iqdpg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], term=0}
[2022-04-12T01:44:14,725][TRACE][o.e.d.PeerFinder         ] [crema-es-02-1] address [192.168.200.20:9302], node [{crema-es-02-2}{au75-y-xSV6bADfPwnxf1g}{rrUbg1JzS5SZFUFY1Iqdpg}{192.168.200.20}{192.168.200.20:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], requesting [true] received PeersResponse{masterNode=Optional.empty, knownPeers=[{crema-es-01-1}{HC9ZnmlBTjSOIdHNVyfIbg}{f1DPQ30mTXGw-6e0muSyTA}{192.168.200.10}{192.168.200.10:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}, {crema-es-01-2}{kCq5b1LvSHiy1DW0LquQmg}{F4c7kMnZQeeyTxDP2_k8xg}{192.168.200.10}{192.168.200.10:9302}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server1, transform.node=true}, {crema-es-02-1}{7BDNjBbYSy2DgDvUdZ95MQ}{CHbud40ZTry8eMpAZWQ4JQ}{192.168.200.20}{192.168.200.20:9301}{cdfhilmrstw}{ml.machine_memory=3661762560, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, server_id=server2, transform.node=true}], term=26}

That means the two es01-* nodes that you think are working ok are different processes from the nodes involved here. Not sure what to suggest, I guess you have some kind of virtualisation (e.g. Docker containers) that mean you have distinct processes with the same address.

Yes. Nodes, es01-* and es02-* are running in the same VM each, as ubuntu services, not containers:

  • es01-1 - 192.168.200.10:9301
  • es01-2 - 192.168.200.10:9302
  • es02-1 - 192.168.200.20:9301
  • es02-2 - 192.168.200.20:9302

Without TLS configurations, xpack.security.enabled: false and xpack.security.transport.ssl.* are disabled, they worked.

Hello, may I ask what is the reason behind such an interesting setup? Why not just have four dedicated VMs and ditch all that custom port hustle as you splitting resources for services anyway? From my personal experience, master not discovered errors come from firewall issues on 9200 (external) or 9300 (inter-node) ports, bad quorum on elections, resource-hogging, etc... So all this seems to come from your untraditional setup :thinking:

Why not just have four dedicated VMs and ditch all that custom port hustle as you splitting resources for services anyway?

Since the lack of computing resources. I have only 2 hosts and do not using VMs in them, so 4 nodes in 2 hosts.

firewall issues

It does not seem so since they can connect.

# In 192.168.200.10
$ nc -zv 192.168.200.20 9301
Connection to 192.168.200.20 9301 port [tcp/*] succeeded!
$ nc -zv 192.168.200.10 9302
Connection to 192.168.200.10 9302 port [tcp/*] succeeded!
# In 192.168.200.20
$ nc -zv 192.168.200.10 9301
Connection to 192.168.200.10 9301 port [tcp/*] succeeded!
$ nc -zv 192.168.200.10 9302
Connection to 192.168.200.10 9302 port [tcp/*] succeeded!

Thank you for your concern anyway!