High network usage by master node

Hello,

We have self-hosted Elasticsearch, used primarily for logs. Cluster is made of 3 nodes + one is only fleet-server.

We have encountered very high network usage, now download is over 1 GiB/s, upload nearing that number. We though that might be communication between nodes, but numbers don't add up. I'm including screen from our Grafana (we know that we have problem with disc usage, we are actively working on that, also ignore time title, those are numbers from last 5 minutes).
I think that it might be due to having deployed 170 elastic agents. But if that's the case, we might have to change the technology for monitoring. Right now, only basic system and elastic agent integration is used, with 4 agents using Prometheus integration.

Can you share the elasticsearch.yml of all your nodes just to confirm the roles?

It lookes like you have 3 nodes, but just one as master eligible?

Also, Elastic Agent integrations uses Ingest pipeline, and from what you shared you have just one node working as Ingest Node, so all the data sent by the agents will be processed by this node.

2 Likes

Hi,

That is correct, master node has most HW capabilities. We are running elastic in docker, included are parts from docker compose

es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    restart: unless-stopped
    labels:
      co.elastic.logs/module: elasticsearch
    volumes:
      - eslogs:/usr/share/elasticsearch/logs
      - /docker/elastic_compose/certs:/usr/share/elasticsearch/config/certs
      - esdata01:/usr/share/elasticsearch/data
      - ./letsencrypt-copy/elastic:/usr/share/elasticsearch/config/letsencrypt:ro
      - /elastic:/elastic
    ports:
      - ${ES_PORT}:9200
      - 9300:9300
    environment:
      - node.name=es01
      - cluster.name=${CLUSTER_NAME}
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01
      - network.publish_host=my-publish-host:8080
      - path.repo=/elastic
      - path.logs=/usr/share/elasticsearch/logs
      - node.roles=master,data,remote_cluster_client,ingest,transform
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=my-key.pem
      - xpack.security.http.ssl.certificate=fullchain-key.pem
      - xpack.security.http.ssl.certificate_authorities=certs-bundle.pem
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=certs/my-key.key
      - xpack.security.transport.ssl.certificate=certs/my-cert.crt
      - xpack.security.transport.ssl.certificate_authorities=certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
    mem_limit: ${ES_MEM_LIMIT}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert config/certs/ca/ca-bundle.pem https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120
es02:
    image: elastic/elasticsearch:${STACK_VERSION}
    restart: unless-stopped
    volumes:
      - /docker2/docker/certs:/usr/share/elasticsearch/config/certs:ro
      - esdata02:/usr/share/elasticsearch/data
      - /elastic:/elastic
    ports:
      - "${ES_PORT}:9200"
      - "9300:9300"
    environment:
      - node.name=es02
      - cluster.name=${CLUSTER_NAME}
      - discovery.seed_hosts=es01,es03
      - network.publish_host=my-host:8080
      - path.repo=/elastic
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - node.roles=data
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true


      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=/usr/share/elasticsearch/config/certs/es02/es02.key
      - xpack.security.http.ssl.certificate=/usr/share/elasticsearch/config/certs/es02/es02.crt
      - xpack.security.http.ssl.certificate_authorities=/usr/share/elasticsearch/config/certs/ca/ca-bundle.pem


      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=/usr/share/elasticsearch/config/certs/es02/es02.key
      - xpack.security.transport.ssl.certificate=/usr/share/elasticsearch/config/certs/es02/es02.crt
      - xpack.security.transport.ssl.certificate_authorities=/usr/share/elasticsearch/config/certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate

      - xpack.license.self_generated.type=${LICENSE}

    mem_limit: 20g
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert /usr/share/elasticsearch/config/certs/ca/ca-bundle.pem https://localhost:9200 | grep -q 'missing authentication credentials'"
        ]
      interval: 10s
      timeout: 5s
      retries: 60

es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    restart: unless-stopped
    labels:
      co.elastic.logs/module: elasticsearch
    volumes:
      - esdata03:/usr/share/elasticsearch/data
      - config03:/usr/share/elasticsearch/config
      - /docker2/docker/config/certs:/usr/share/elasticsearch/config/certs:ro
      - /elastic:/elastic
    ports:
      - ${ES_PORT}:9200
      - 9300:9300
    environment:
      - node.name=es03
      - cluster.name=${CLUSTER_NAME}
      - discovery.seed_hosts=es01,es02
      - network.publish_host=my-publish-host:8080
      - path.repo=/elastic
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - node.roles=data
      - bootstrap.memory_lock=true
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=true
      - xpack.security.http.ssl.key=/usr/share/elasticsearch/config/certs/es01/es01.key
      - xpack.security.http.ssl.certificate=/usr/share/elasticsearch/config/certs/es01/es01.crt
      - xpack.security.http.ssl.certificate_authorities=/usr/share/elasticsearch/config/certs/ca/ca-bundle.pem
      - xpack.security.transport.ssl.enabled=true
      - xpack.security.transport.ssl.key=/usr/share/elasticsearch/config/certs/es01/es01.key
      - xpack.security.transport.ssl.certificate=/usr/share/elasticsearch/config/certs/es01/es01.crt
      - xpack.security.transport.ssl.certificate_authorities=/usr/share/elasticsearch/config/certs/ca/ca.crt
      - xpack.security.transport.ssl.verification_mode=certificate
      - xpack.license.self_generated.type=${LICENSE}
      - ES_JAVA_OPTS=-Xms12g -Xmx12g
    mem_limit: 20g
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s --cacert /usr/share/elasticsearch/config/certs/ca/ca-bundle.pem https://localhost:9200 | grep -q 'missing authentication credentials'"
        ]
      interval: 10s
      timeout: 10s
      retries: 120

Hey @rara01,

Some thoughts.

es01 => node.roles=master,data,remote_cluster_client,ingest,transform
es02 => node.roles=data
es03 => node.roles=data

Just like @leandrojmp is saying.

There is only one master, es01, which alos has a bunch of different roles.

  • master (cluster coordination)
  • data (indexing & search)
  • ingest (pipeline pre-processing)
  • remote_cluster_client
  • transform

This is likely the main culprit for the high network traffic. The only master node is doing master, data and ingest work. It’s receiving traffic from agents, Fleet server, Kibana, and local monitoring. It’s also doing coordination + cluster state updates and handling actual user data and transforms. It may also be logging or monitoring itself heavily.

Are these Docker containers running on dedicated hosts?

Also I noticed that node es03 has the following config:

      - xpack.security.http.ssl.key=/usr/share/elasticsearch/config/certs/es01/es01.key
      - xpack.security.http.ssl.certificate=/usr/share/elasticsearch/config/certs/es01/es01.crt
      - xpack.security.http.ssl.certificate_authorities=/usr/share/elasticsearch/config/certs/ca/ca-bundle.pem

This seems like es03 is using es01's private key and cert. That is a security anti-pattern and could cause TLS-level conflicts, identity mismatches, and even routing issues. It may be contributing to network confusion or repeated SSL handshakes.

Best regards,

Willem

Good spots - @rara01 , there is very little symmetry here ... even between es02 and es03?

es03 has a mount config03:/usr/share/elasticsearch/config, none of the other nodes have this.

es03 has ES_JAVA_OPTS=-Xms12g -Xmx12g, neither of the other 2 nodes have this setting. Which is, btw, more than 50% of the 20g memory.

es02's retries and timeout dont match the other 2 nodes.

Er ... ? Whose key is the letsencrypt one ?

$ fgrep xpack.security.http.ssl.key es??                                                                                                    11:34:04
es01:      - xpack.security.http.ssl.key=letsencrypt/privkey.pem
es02:      - xpack.security.http.ssl.key=/usr/share/elasticsearch/config/certs/es02/es02.key
es03:      - xpack.security.http.ssl.key=/usr/share/elasticsearch/config/certs/es01/es01.key

And es01 has cluster.initial_master_nodes as well as discovery.seed_hosts, the cluster.initial_master_nodes setting is probably unwise at this point.

EDIT: Forgot to add, you can use tools like ntopng (surely there are others within docker ecosystem) to get a breakdown of traffic flows, can sometimes be helpful.

1 Like

Docker containers are running on dedicated VMs, but on same server. So we are starting to worry about our network card limitations.
So, you think that the high traffic may be from elastic agents ingest operations? Maybe we could do ingest node on different server?
Regarding certificates, we will take a look into that. We were starting elastic with the "Getting started with ELK on Docker Compose", so those certificates come from that. We will change that.

Thank you for those catches. We had limitations everywhere, but after some time, we were raising memory for all nodes and elastic wasn't showing that added memory, even when this variable was changed. In process of debbuging, we left this there.
Regarding letsencrypt, only first node has publicly available url, that's using letsencrypt certificate. What i thought is that data nodes are connecting to master node in http and transport aswell, and i just left those certificates in transport and http xpack settings.

ES03 has this volume because, weirdly, when setting this node, there were some errors that i didnt encounter while seeting up rest of the cluster. This was one way of debbuging, again.

But thank you for all those catches, we will take a look into that.

maybe I'm too old school os mis-interpreted, but ...

One server, which hosts 3x ES instances, which are docker containers running within 3x VMs, only one of which has "publicly available url" (hopefully you dont really means "publicly" available!), which is the only master and only ingest node, maybe all sharing one network card (and who knows what else as I'm not sure what is really "dedicated" here), servicing (currently) 170 agents, with a somewhat untidy ES setup.

That does not strike me as optimal. I'd strongly recommend you act on @leandrojmp 's points in his 1st reply. Particularly if the 170 is going to increase significantly, or you will integrate more capability.

Servers do have enough capacity to support those nodes - in terms of computing power. But if ingest role is for master, how would that transform to network usage? If one node is downloading data, tranforming them with pipelines, then it should save them, right? I understand that there is usage with uploading data to other nodes, but as i mentioned, the usage is higher than download/upload of data from master to data nodes.

The setup is untidy, and i will edit them as mentioned. But what's wrong about the rest, also what do you mean by that "publicly"? We have public URL for elastic and kibana, so the database and UI is available outside our organisation, as we have customers sending data to the database, and in future, we plan on to making dashboards for them.

(this post ended up too long, sorry!)

OK, this is side point, let's take it first. If I can (in any sense really) access your elasticsearch/kibana on ports 9200/5601, that would be too "publicly" available IMHO. If there's a public DNS names/IP addresses, like AWS can assign a EC2 instance a public address, and security groups and all that jazz that limits access to specific points from specific addresses, or variants thereof with cloudfront or 101 other methods, its still "public" but not in same sense. I presume now thats what you meant by "publicly" originally. Thats all I was getting at.

If you meant that es01 is the only one that the 170 elastic agents can reach, then ... see next point.

As written, IMO thats not correct, but the specific wording is not completely clear. I asked chatGPT this (hoping for an unambiguously worded reply):

"I have a cluster with 3 nodes. node1 is master, ingest, transfor and data node
node2 is a dedicated data node. node3 is a dedicated data node

Can you describe how a documents sent to say node3, where an ingest pipeline will be involved, will traverse the cluster."

It replied (edited):

"Even if the document is sent to node3, it will be routed to node1 for pipeline processing, then routed back to any data node (including node3) for storage."

That's my understanding too. This is effectively re-wording this point:

Given every piece of agent data in your cluster is ingested, and every ingest is via the master node, which btw also does all other transforms, and maintains cluster state, and has its own data, and some of the data (maybe 2/3rds depending on the index setup) ends up the other 2 nodes, then I dont think it's a priori surprising the master node has a significant share of network traffic.

Does it explain the actual numbers you see? I don't know, but AFAIK there's nothing written on the thread so far that suggests it would not.

es01/2/3 are on the same physical host, right? My "old-school" point was with the various layers in your setup, hardware/OS/VMs/docker/..., you end up with a lot of arguably un-necessary overhead, like network traffic and other things, 3x JVMs, 3x java heaps, 3x (probably more in fact) file systems caches, ... Had you a 1-node cluster, doing everything, running way closer to the hardware, you would have less of these overheads and less complexity. And, perhaps, exactly the same problem/confusion. I freely admit this is not the modern way, nor even a common way.

2 Likes