Can't find master nodes after node restart

Barak · August 16, 2020, 12:25pm

After upgrading to 7.7.1, the data nodes can't find the master nodes after a restart of the data node.
I am using EC2 discovery plugin, and on the initial startup it joins the cluster as expected, but once restarted in says it doesn't discover master nodes though they are listed in the list of IP's:

[2020-08-16T10:51:33,380][WARN ][o.e.d.HandshakingTransportAddressConnector] [ip-172-30-1-7.ec2.internal] handshake failed for [connectToRemoteMasterNode[172.30.2.153:9300]]
org.elasticsearch.transport.SendRequestTransportException: [][172.30.2.153:9300][internal:transport/handshake]
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:719) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.sendWithUser(SecurityServerTransportInterceptor.java:162) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.access$300(SecurityServerTransportInterceptor.java:53) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.lambda$sendRequest$0(SecurityServerTransportInterceptor.java:114) ~[?:?]
        at org.elasticsearch.xpack.core.security.SecurityContext.executeAsUser(SecurityContext.java:127) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:114) ~[?:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:621) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.transport.TransportService.handshake(TransportService.java:458) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.transport.TransportService.handshake(TransportService.java:436) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.onResponse(HandshakingTransportAddressConnector.java:95) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.onResponse(HandshakingTransportAddressConnector.java:88) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:98) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.1.jar:7.7.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {ip-172-30-1-7.ec2.internal}{E2NEDQZWQ9Kf6G0pquFvfw}{KIjua5LDT-KpwNEpHt0bYg}{172.30.1.7}{172.30.1.7:9300}{dilrt}{aws_availability_zone=us-east-1b
, ml.machine_memory=66715250688, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:701) ~[elasticsearch-7.7.1.jar:7.7.1]
        ... 17 more

And then it get stuck in this cycle:

[2020-08-16T10:52:09,845][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-172-30-1-7.ec2.internal] master not discovered yet: have discovered [{ip-172-30-1-7.ec2.internal}{E2NEDQZWQ9Kf6G0pquFvfw}{jj2zLzYHR8GU-JiZk2FScw}{172.17.0.1}{172.17.0.1:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=66715250688, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {ip-172-30-0-123.ec2.internal}{vPn-iGVfQdWH_PCebTHUiQ}{a_Zb72LGSdiT92cjpwPaeg}{172.30.0.123}{172.30.0.123:9300}{lm}{aws_availability_zone=us-east-1a, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-2-62.ec2.internal}{sMuj5JCHQNKcNfUZdovHfA}{Sj6ihWVYRh6gpDx9yvXwhw}{172.30.2.62}{172.30.2.62:9300}{lm}{aws_availability_zone=us-east-1c, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-1-9.ec2.internal}{I7bTPetXQxmLz_a7maEWFw}{EGbpYlR7TvucszK5IaauYw}{172.30.1.9}{172.30.1.9:9300}{lm}{aws_availability_zone=us-east-1b, ml.machine_memory=16820563968, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-2-153.ec2.internal}{5M30QMotQhWEER6NEm_wiw}{NdlEK_tURk6VzwPdUnKSOQ}{172.30.2.153}{172.30.2.153:9300}{lm}{aws_availability_zone=us-east-1c, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}, {ip-172-30-0-170.ec2.internal}{3uWMMAZRRO2e_0aOYpELHA}{OCck7fwqSXuZ6kC4amioZw}{172.30.0.170}{172.30.0.170:9300}{lmr}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300] from hosts providers and [] from last-known cluster state; node term 83, last-accepted version 1086389 in term 83

Any help will be greatly appreciated

DavidTurner · August 16, 2020, 12:36pm

Are you sure the currently-elected master node appears in the list of discovered nodes? Which one is it?

If so, I think there will be some other log messages too indicating that it is failing to join it for some reason. They might not be very frequent. Can you check for them too?

Barak · August 16, 2020, 1:04pm

There's the connection error from the first log I attached.

Also, after setting log level to debug, this is the output (note all the master nodes are listed there - using dynamic transport addresses [172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.
> 214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300):

[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] [ip-172-30-1-7.ec2.internal] Connection [id: 0][route: {s}->https://ec2.us-east-1.amazonaws.com:443] can be kept alive for 60.0 secon
ds
[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.DefaultManagedHttpClientConnection] [ip-172-30-1-7.ec2.internal] http-outgoing-0: set socket timeout to 0
[2020-08-16T11:01:44,644][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] [ip-172-30-1-7.ec2.internal] Connection released: [id: 0][route: {s}->https://ec2.us-east-1.amazonaws.com:443][total kept alive: 1; r
oute allocated: 1 of 50; total allocated: 1 of 50]
[2020-08-16T11:01:44,647][DEBUG][o.e.d.e.AwsEc2SeedHostsProvider] [ip-172-30-1-7.ec2.internal] using dynamic transport addresses [172.30.0.123:9300, 172.30.2.62:9300, 172.30.1.9:9300, 172.30.2.153:9300, 172.30.2.
214:9300, 172.30.2.243:9300, 172.30.0.4:9300, 172.30.0.170:9300]
[2020-08-16T11:01:44,681][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=[::1]:9301, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [[::1]:9301] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /[0:0:0:0:0:0:0:1]:9301
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more
[2020-08-16T11:01:44,681][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=[::1]:9304, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [[::1]:9304] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /[0:0:0:0:0:0:0:1]:9304
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more
[2020-08-16T11:01:44,682][DEBUG][o.e.d.PeerFinder ] [ip-172-30-1-7.ec2.internal] Peer{transportAddress=127.0.0.1:9303, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [127.0.0.1:9303] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:998) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.7.1.jar:7.7.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.7.1.jar:7.7.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9303
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:589) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:839) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
... 7 more

DavidTurner · August 16, 2020, 1:46pm

Nope, that's not helpful at all, that's just saying it can't connect to anything at nonstandard ports like 127.0.0.1:9303 which is hardly surprising. I don't think you need DEBUG logs here.

Similarly the NodeClosedException from the first message is also uninformative, it just means the node was shutting down during the connection.

Go back to the default logging config to get rid of all this junk and look for messages about problems joining the cluster.

DavidTurner · August 16, 2020, 1:52pm

Also, repeating my first question:

What does GET _cat/master return?

Barak · August 16, 2020, 2:37pm

When the node is in the cluster and everything works fine:

curl localhost:9200/_cat/master
3uWMMAZRRO2e_0aOYpELHA 172.30.0.170 172.30.0.170 ip-172-30-0-170.ec2.internal

restarted elasticsearch:

systemctl restart elasticsearch

Elasticsearch running. Logs:
elasticsearch log · GitHub

DavidTurner · August 16, 2020, 4:09pm

Something looks very wrong with your logging config -- there are no INFO messages, for instance. You'll need to see those.

Can we see the logs (including INFO logs) from the master too?

Barak · August 17, 2020, 8:22am

You were right David, I misconfigured the logs.
You can see the data + master logs here:

gist.github.com

https://gist.github.com/barakseri1/3f107c3a38b5e5549ce90ac66117769f

data_node

[2020-08-17T08:08:14,479][INFO ][o.e.e.NodeEnvironment    ] [ip-172-30-1-61.ec2.internal] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/mapper/vg_elastic-lv_elastic)]], net usable_space [886.4gb], net total_space [984.1gb], types [ext4]
[2020-08-17T08:08:14,482][INFO ][o.e.e.NodeEnvironment    ] [ip-172-30-1-61.ec2.internal] heap size [31gb], compressed ordinary object pointers [true]
[2020-08-17T08:08:14,622][INFO ][o.e.n.Node               ] [ip-172-30-1-61.ec2.internal] node name [ip-172-30-1-61.ec2.internal], node ID [heumpdUlSRezT69xwljSgg], cluster name [elasticsearch-prod]
[2020-08-17T08:08:14,624][INFO ][o.e.n.Node               ] [ip-172-30-1-61.ec2.internal] version[7.7.1], pid[24785], build[default/rpm/ad56dce891c901a492bb1ee393f12dfff473a423/2020-05-28T16:30:01.040088Z], OS[Linux/4.14.97-90.72.amzn2.x86_64/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/14.0.1/14.0.1+7]
[2020-08-17T08:08:14,625][INFO ][o.e.n.Node               ] [ip-172-30-1-61.ec2.internal] JVM home [/usr/share/elasticsearch/jdk]
[2020-08-17T08:08:14,625][INFO ][o.e.n.Node               ] [ip-172-30-1-61.ec2.internal] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms31g, -Xmx31g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-14489621377398879393, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -XX:UseAVX=2, -XX:MaxDirectMemorySize=16642998272, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=rpm, -Des.bundled_jdk=true]
[2020-08-17T08:08:16,526][WARN ][c.a.a.p.i.BasicProfileConfigFileLoader] [ip-172-30-1-61.ec2.internal] Unable to load config file null
java.security.AccessControlException: access denied ("java.io.FilePermission" "/nonexistent/.aws/config" "read")
	at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?]
	at java.security.AccessController.checkPermission(AccessController.java:1036) ~[?:?]

This file has been truncated. show original

master_node

[2020-08-17T08:07:51,552][INFO ][o.e.c.s.MasterService    ] [ip-172-30-0-170.ec2.internal] node-left[{ip-172-30-1-61.ec2.internal}{heumpdUlSRezT69xwljSgg}{a5VszEDvT7uXNy2aCzcXIg}{172.30.1.61}{172.30.1.61:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=67534430208, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} reason: disconnected], term: 83, version: 1092774, delta: removed {{ip-172-30-1-61.ec2.internal}{heumpdUlSRezT69xwljSgg}{a5VszEDvT7uXNy2aCzcXIg}{172.30.1.61}{172.30.1.61:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=67534430208, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}
[2020-08-17T08:07:51,602][INFO ][o.e.c.s.ClusterApplierService] [ip-172-30-0-170.ec2.internal] removed {{ip-172-30-1-61.ec2.internal}{heumpdUlSRezT69xwljSgg}{a5VszEDvT7uXNy2aCzcXIg}{172.30.1.61}{172.30.1.61:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=67534430208, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}, term: 83, version: 1092774, reason: Publication{term=83, version=1092774}
[2020-08-17T08:07:51,604][INFO ][o.e.c.r.DelayedAllocationService] [ip-172-30-0-170.ec2.internal] scheduling reroute for delayed shards in [945.8ms] (8 delayed shards)
[2020-08-17T08:07:52,550][INFO ][o.e.c.r.DelayedAllocationService] [ip-172-30-0-170.ec2.internal] scheduling reroute for delayed shards in [0s] (8 delayed shards)
[2020-08-17T08:07:52,583][INFO ][o.e.c.r.DelayedAllocationService] [ip-172-30-0-170.ec2.internal] scheduling reroute for delayed shards in [58.9s] (7 delayed shards)
[2020-08-17T08:07:53,933][WARN ][o.e.c.r.a.AllocationService] [ip-172-30-0-170.ec2.internal] [images_restored_index][5] marking unavailable shards as stale: [JV-KNH8TTDmzZPjGmNHlcg]
[2020-08-17T08:07:59,524][WARN ][o.e.c.r.a.AllocationService] [ip-172-30-0-170.ec2.internal] [challenges_prod][0] marking unavailable shards as stale: [ee5wXBXrTuCyqb5dhE-iAw]
[2020-08-17T08:08:06,155][WARN ][o.e.c.r.a.AllocationService] [ip-172-30-0-170.ec2.internal] [images_restored_index][0] marking unavailable shards as stale: [MaRevXhyRNO3OCBW8nkyfA]

[2020-08-17T08:08:51,582][INFO ][o.e.c.r.DelayedAllocationService] [ip-172-30-0-170.ec2.internal] scheduling reroute for delayed shards in [10.9m] (2 delayed shards)

This file has been truncated. show original

DavidTurner · August 17, 2020, 9:27am

Ah, that's better. Join failures are only logged at INFO level since they're not always a problem, but here they are.

[2020-08-17T08:08:53,617][INFO ][o.e.c.c.JoinHelper       ] [ip-172-30-1-61.ec2.internal] failed to join {ip-172-30-0-170.ec2.internal}{3uWMMAZRRO2e_0aOYpELHA}{OCck7fwqSXuZ6kC4amioZw}{172.30.0.170}{172.30.0.170:9300}{lmr}{aws_availability_zone=us-east-1a, ml.machine_memory=16626966528, ml.max_open_jobs=20, xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ip-172-30-1-61.ec2.internal}{heumpdUlSRezT69xwljSgg}{3NbMrFIKTRylx4c9z1xnQg}{172.17.0.1}{172.17.0.1:9300}{dilrt}{aws_availability_zone=us-east-1b, ml.machine_memory=67534430208, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=83, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [ip-172-30-0-170.ec2.internal][172.30.0.170:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [ip-172-30-1-61.ec2.internal][172.17.0.1:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1004) ~[elasticsearch-7.7.1.jar:7.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.1.jar:7.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

The master node ip-172-30-0-170.ec2.internal cannot connect to the data node ip-172-30-1-61.ec2.internal -- the data node claims its address is 172.17.0.1:9300 but that doesn't match the node name.

Barak · August 17, 2020, 10:54am

Thanks David.
It incorrectly published the docker bridge ip. Shouldn't it pick the eth0 interface by default instead of docker0?

I added network.publish_host: <<private_ip>> to the elasticsearch.yml to fix it.

DavidTurner · August 17, 2020, 11:16am

Are you running these nodes within Docker using a bridge network? The Docker docs say not to do that:

Bridge networks apply to containers running on the same Docker daemon host. For communication among containers running on different Docker daemon hosts, you can either manage routing at the OS level, or you can use an overlay network.

You can use the special value _ec2:privateIpv4_ instead of having to work out what the IP address is yourself.

Barak · August 17, 2020, 12:31pm

No, the nodes are not running on containers, that's why I was curious about using this network
I do have docker running, but it is just running a container of elasticsearch exporter for prometheus.

DavidTurner · August 17, 2020, 12:36pm

Gotcha. In which case you don't want to set network.publish_host at all, you should set network.host: _ec2:privateIpv4_ instead.

Barak · August 17, 2020, 1:29pm

Noted. Thanks a lot

system · September 14, 2020, 1:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple nodes on elasticsearch Elasticsearch	11	879	November 21, 2018
Data node cannot find master node Elasticsearch	23	4193	October 22, 2019
message: [WARN ][cluster.service ] [node1] failed to reconnect to node [node1][I4Wltlc9RSm0jJhumBRtpQ][inet[/10.10.10.1:9300]] Elasticsearch	14	1649	December 31, 2013
Error: MasterNotDiscoveredException Elasticsearch	4	744	July 6, 2017
Master node refuse to accept its role Elasticsearch	6	969	July 6, 2017

Can't find master nodes after node restart

Related topics