How is Docker/Docker-Compose getting in the way?

jerrac · April 27, 2020, 11:26pm

I use docker-compose to keep most of my configuration in git for various applications. For the past few days I've been trying to get a test Elasticsearch cluster up and running. After way too much troubleshooting, I've narrowed the issue down to docker or docker-compose.

Basically, if I run Elasticsearch extracted from the tarball, it works just fine. But if I run it via docker-compose, the nodes never get past master discovery.

Here is my elasticsearch.yml file:

cluster.name: mycluster
node.name: testelk01.example.org
node.master: true
http.port: 9200
transport.port: 9300

network.host: _site_
discovery.seed_hosts:
  - testelk01.example.org
  - testelk02.example.org
  - testelk03.example.org
cluster.initial_master_nodes:
  - testelk01.example.org
  - testelk02.example.org
  - testelk03.example.org

#http.cors.enabled: true
#http.cors.allow-origin: "*"
#http.cors.allow-headers: X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization,Access-Control-Allow-Origin
#http.cors.allow-credentials: true

xpack.license.self_generated.type: basic
xpack.ilm.enabled: true
xpack.monitoring.enabled: true

# security settings
xpack.security.enabled: false

#xpack.security.http.ssl.enabled: true
#xpack.security.http.ssl.key: "/usr/share/elasticsearch/config/certs/mycluster_wildcard_example_org.key"
#xpack.security.http.ssl.certificate: "/usr/share/elasticsearch/config/certs/mycluster_wildcard_example_org.crt"
#xpack.security.http.ssl.certificate_authorities:
#  - "/usr/share/elasticsearch/config/certs/DigiCertCA.crt"
#  - "/usr/share/elasticsearch/config/certs/DigiCertTrustedRoot.crt"
#
#
#xpack.security.transport.ssl.enabled: true
#xpack.security.transport.ssl.verification_mode: none
#xpack.security.transport.ssl.key: "/usr/share/elasticsearch/config/certs/mycluster_wildcard_example_org.key"
#xpack.security.transport.ssl.certificate: "/usr/share/elasticsearch/config/certs/mycluster_wildcard_example_org.crt"
#xpack.security.transport.ssl.certificate_authorities:
#  - "/usr/share/elasticsearch/config/certs/DigiCertCA.crt"
#  - "/usr/share/elasticsearch/config/certs/DigiCertTrustedRoot.crt"

path.data: /usr/share/elasticsearch/data
path.logs: /usr/share/elasticsearch/logs

bootstrap.memory_lock: false

My docker-compose file:

version: '3.3'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
    container_name: testelk01_elasticsearch
    environment:
      - "ES_JAVA_OPTS=-Xms6144m -Xmx6144m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ./config/elasticsearch/testelk01/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
      - ./certs/:/usr/share/elasticsearch/config/certs/
      - /mnt/elasticsearch_iscsi/testelk01/elasticsearch/data:/usr/share/elasticsearch/data
      - /mnt/elasticsearch_iscsi/testelk01/elasticsearch/logs:/usr/share/elasticsearch/logs
    ports:
      - "< internel ip 1 >:9200:9200"
      - "< internel ip 1 >:9300-9400:9300-9400"
    healthcheck:
      test: ["CMD", "curl","-s" ,"-f", "http://localhost:9200/_cat/health"]
    networks:
      - elknet
    extra_hosts:
      - "testelk01.example.org:< internel ip 1 >"
      - "testelk02.example.org:< internel ip 2 >"
      - "testelk03.example.org:< internel ip 3 >"
    restart: always
  kibana:
    container_name: testelk01_kibana
    image: docker.elastic.co/kibana/kibana:7.6.2
    volumes:
      - ./config/kibana/testelk01/kibana.yml:/usr/share/kibana/config/kibana.yml
      - /mnt/elasticsearch_iscsi/testelk01/kibana/data:/usr/share/kibana/data
      - ./certs/:/usr/share/kibana/config/certs/
    ports:
      - 127.0.0.1:5601:5601
    networks:
      - elknet
    extra_hosts:
      - "testelk01.example.org:< internel ip 1 >"
      - "testelk02.example.org:< internel ip 2 >"
      - "testelk03.example.org:< internel ip 3 >"
    restart: always

networks:
  elknet:
    driver: bridge
    driver_opts:
      com.docker.network.bridge.name: elknet

Just adjust the node name to 02 and 03, and you'll have the file for my other two nodes.

I had been trying to set things up with ssl, but I commented all that out while troubleshooting.

To test, I set up a temp directory for data and logs, adjusted my elasticsearch.yml file to use them, and ran this on each node:

ES_PATH_CONF=/srv/elktemp/elasticsearch-7.6.2/config ./elasticsearch

It worked just fine. They discovered each other, and curl'ing the heath said there were 3 nodes and it was green.

But if I take nearly the exact same config and run it via docker-compose, the nodes never find each other.

They eventually just keep repeating this message:

testelk03_elasticsearch | {"type": "server", "timestamp": "2020-04-27T23:03:00,987Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "mycluster", "node.name": "testelk03.example.org", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [testelk01.example.org, testelk02.example.org, testelk03.example.org] to bootstrap a cluster: have discovered [{testelk03.example.org}{eFxpuLZ0RPCbmi1Toiqm9A}{6Ts-bEguRDaVqFq0SV4bog}{172.18.0.2}{172.18.0.2:9300}{dilm}{ml.machine_memory=8364195840, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [< internel ip 1 >:9300, < internel ip 2 >:9300, < internel ip 3 >:9300] from hosts providers and [{testelk03.example.org}{eFxpuLZ0RPCbmi1Toiqm9A}{6Ts-bEguRDaVqFq0SV4bog}{172.18.0.2}{172.18.0.2:9300}{dilm}{ml.machine_memory=8364195840, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

It shouldn't be a networking or firewall issue. I exec'd into a container, installed nmap, and checked that port 9300 on another node was open. It was. So was 9200.

I even tried turning off the host firewall.

Since it works when I remove Docker from the equation, that must mean Docker is the issue, but I'm really not sure how. If I can communicate with the other containers from within them, then Elasticsearch should be able to as well.

Any ideas?

Thanks in advance.

DavidTurner · April 28, 2020, 5:10am

Looks like a connectivity issue indeed, and I think the fact that you can't determine this from the logs is a bug that's fixed in 7.7.0.

In the log line you shared the node claims to have address 172.18.0.2:9300. Are you sure that the other nodes can communicate with it at that address?

rcowart · April 28, 2020, 7:39am

You probably need to set network.publish_host. From the docs...

The publish host is the single interface that the node advertises to other nodes in the cluster, so that those nodes can connect to it. Currently an Elasticsearch node may be bound to multiple addresses, but only publishes one. If not specified, this defaults to the “best” address from network.host, sorted by IPv4/IPv6 stack preference, then by reachability. If you set a network.host that results in multiple bind addresses yet rely on a specific address for node-to-node communication, you should explicitly set network.publish_host.

Rob

How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

DavidTurner · April 28, 2020, 7:43am

I disagree. You should only set network.publish_host if you want Elasticsearch to listen for transport connections on multiple interfaces, which doesn't really make sense unless you're using cross-cluster search or replication, or the now-deprecated transport client, and even then only with a reasonably esoteric network configuration.

The setting you're after is network.host.

rcowart · April 28, 2020, 7:54am

You can disagree, but it doesn't make me wrong. Especially when using docker with bridge mode networking.

DavidTurner · April 28, 2020, 8:20am

Indeed, you're not wrong, you can use network.publish_host to resolve this problem too. It's just unnecessarily complicated to use two network settings if you're only listening on a single interface. Better to avoid these more specialised settings where they're not needed.

jerrac · April 28, 2020, 4:38pm

No, they wouldn't be able to. Each container is on it's own host, using it's own docker network. I haven't tried to enable any cross host docker networking.

I'm pretty sure Elasticsearch can't bind to the specific NIC address from within a container. Unless I put the containers network mode into "host".

Would using 0.0.0.0 for network.host help?

The docs say that _site_ is for:

Any site-local addresses on the system, for example 192.168.0.1 .

I was thinking that all the addresses available inside the docker container would be local addresses. But as I think about that, I'm not so sure anymore.

jerrac · April 28, 2020, 5:00pm

Hmm... Nope.

network.host: 0.0.0.0
network.publish_host: 0.0.0.0

Did not help.

I also tried with just

network.host: 0.0.0.0

Looks like it is still publishing on the Docker container network.

"message": "publish_address {172.21.0.2:9300}, bound_addresses {0.0.0.0:9300}"

Why wouldn't Docker's iptable rules route things to the 172.21.0.2:9300? Is publish a broadcast, and Docker doesn't route broadcasts out correctly?

jerrac · April 28, 2020, 5:18pm

Ok, so I tried:

network_mode: host

in the docker-compose service config.

Initially I tried to bind to a specific nic via network.host in elasticsearch.yml. That resulted in an error:

testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:06,543Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "uncaught exception in thread [main]", 
testelk01_elasticsearch | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: BindTransportException[Failed to bind to [9300]]; nested: BindException[Cannot assign requested address];",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:174) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125) ~[elasticsearch-cli-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "Caused by: org.elasticsearch.transport.BindTransportException: Failed to bind to [9300]",
testelk01_elasticsearch | "at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:389) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:355) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | uncaught exception in thread [main]
testelk01_elasticsearch | "at org.elasticsearch.transport.netty4.Netty4Transport.doStart(Netty4Transport.java:135) ~[?:?]",
testelk01_elasticsearch | "at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.transport.TransportService.doStart(TransportService.java:230) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.node.Node.start(Node.java:697) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:273) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:358) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.6.2.jar:7.6.2]",
testelk01_elasticsearch | "... 6 more",
testelk01_elasticsearch | "Caused by: java.net.BindException: Cannot assign requested address",
testelk01_elasticsearch | "at sun.nio.ch.Net.bind0(Native Method) ~[?:?]",
testelk01_elasticsearch | "at sun.nio.ch.Net.bind(Net.java:469) ~[?:?]",
testelk01_elasticsearch | "at sun.nio.ch.Net.bind(Net.java:458) ~[?:?]",
testelk01_elasticsearch | "at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:220) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:551) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1346) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:503) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:488) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:985) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:247) ~[?:?]",
testelk01_elasticsearch | "at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:344) ~[?:?]",
testelk01_elasticsearch | "at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]",
testelk01_elasticsearch | "at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) ~[?:?]",
testelk01_elasticsearch | "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) ~[?:?]",
testelk01_elasticsearch | "at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) ~[?:?]",
testelk01_elasticsearch | "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
testelk01_elasticsearch | "at java.lang.Thread.run(Thread.java:830) [?:?]"] }
testelk01_elasticsearch | BindTransportException[Failed to bind to [9300]]; nested: BindException[Cannot assign requested address];
testelk01_elasticsearch | Likely root cause: java.net.BindException: Cannot assign requested address
testelk01_elasticsearch | 	at java.base/sun.nio.ch.Net.bind0(Native Method)
testelk01_elasticsearch | 	at java.base/sun.nio.ch.Net.bind(Net.java:469)
testelk01_elasticsearch | 	at java.base/sun.nio.ch.Net.bind(Net.java:458)
testelk01_elasticsearch | 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:220)
testelk01_elasticsearch | 	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
testelk01_elasticsearch | 	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:551)
testelk01_elasticsearch | 	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1346)
testelk01_elasticsearch | 	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:503)
testelk01_elasticsearch | 	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:488)
testelk01_elasticsearch | 	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:985)
testelk01_elasticsearch | 	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:247)
testelk01_elasticsearch | 	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:344)
testelk01_elasticsearch | 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
testelk01_elasticsearch | 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
testelk01_elasticsearch | 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
testelk01_elasticsearch | 	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
testelk01_elasticsearch | 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
testelk01_elasticsearch | 	at java.base/java.lang.Thread.run(Thread.java:830)
testelk01_elasticsearch | For complete error details, refer to the log at /usr/share/elasticsearch/logs/mycluster.log
testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:07,360Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "stopping ..." }
testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:07,367Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "stopped" }
testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:07,367Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "closing ..." }
testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:07,388Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "closed" }
testelk01_elasticsearch | {"type": "server", "timestamp": "2020-04-28T17:06:07,391Z", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "mycluster", "node.name": "testelk01.example.org", "message": "Native controller process has stopped - no new native processes can be started" }
testelk01_elasticsearch exited with code {u'StatusCode': 1, u'Error': None}

Switching to network.host: 0.0.0.0 in elasticsearch.yml made everything work. My cluster was able to form.

That said, I'd really rather not use Docker's host mode. And if I have to, I'd really rather limit ES to listening on the specific network I want it to.

So any ideas on how I can do so?

Thanks. And thanks for all the replies so far.

DavidTurner · April 28, 2020, 6:21pm

Quoting the Docker docs on bridge networking:

Bridge networks apply to containers running on the same Docker daemon host. For communication among containers running on different Docker daemon hosts, you can either manage routing at the OS level, or you can use an overlay network.

The details of that are out of the scope of what I can help you with, but hopefully this points you in the right direction.

No, Elasticsearch works entirely using TCP (i.e. unicast) connections between nodes. It requires that every node can connect to the publish_address of every other node, which was the missing bit here.

Perhaps the most common approach I see is to arrange for the interfaces within each container have predictable names and then bind to a specific interface name using the syntax network.host: _eth0_.

jerrac · April 28, 2020, 7:16pm

Ok, so I'm a bit confused.

If I set up an httpd container and tell docker to use < internal ip >:9999:80 then I can go to http://< internal ip >:9999 in my browser and httpd thinks I'm coming from port 80.

Shouldn't the same apply to Elasticsearch? Whatever I set outside the container shouldn't matter to ES, as long as the data is sent to the port it's bound on.

With my config, that happens to be < internal ip >:9300-9400:9300-9400, so ES should be getting the data sent out over each hosts < internal ip >. It shouldn't matter that ES is listening on an internal docker ip.

Anyway, I'll dig into docker networking a bit more. See if I can figure this out.

Thanks.

DavidTurner · April 28, 2020, 8:23pm

No, what you describe for HTTPd is some kind of a proxy, and Elasticsearch doesn't support being proxied like that.

jerrac · April 28, 2020, 8:28pm

So, this is a proxy situation?

docker run -it --rm -p 127.0.0.1:9999:80 httpd

DavidTurner · April 28, 2020, 8:49pm

I think I've misunderstood something here, I was under the impression that network.publish_host had to be an address of a local interface, but that seems not to be true at all. So in fact I think you can proxy each node on a different address from the address to which it's bound.

Effectively, yes, httpd is binding to port 80 of some interface or another but it has no way to know what the rest of the world thinks its address is. Knowing your own address is not really important for httpd but it is important for Elasticsearch nodes since each node has to share this address with the rest of the cluster to allow them to connect back to it.

But anyway yes I think you can set network.publish_host: < internel ip > to override what Elasticsearch thinks its local address is. NB network.publish_host: 0.0.0.0 won't work since you need to tell Elasticsearch what specifically its address is from the point of view of the rest of the cluster.

jerrac · April 28, 2020, 9:03pm

Setting the publish_host to the ip of the nic I want to use didn't work.

At this point I need to move on. So I'm going to go the "old-fashioned" route and use the anisble role to configure things. Already have my hosts being configured by ansible for most everything else.

Docker was my initial try because our long term plan is move everything to containers. So I'll revisit when my docker knowledge is a bit better.

Thanks for all the help!

system · May 26, 2020, 9:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster Changes to speed up master detection Elasticsearch docker	1	294	April 1, 2020
Enable x-pack on a Docker-Compose elasticsearch cluster Elasticsearch docker	4	8676	February 4, 2020
Elastic tools won't connect in simple docker-compose config Elasticsearch elastic-stack-security , docker	7	57	December 12, 2024
Elasticsearch cluster on Docker for production on different phisical servers Elasticsearch docker	1	402	March 27, 2019
Master not discovered or elected yet, an election requires 2 nodes with ids Elasticsearch elastic-stack-security , docker	2	676	May 26, 2021

How is Docker/Docker-Compose getting in the way?

Related topics