Slower Perfomance with Elaticsearch cluster in kubernetes compared to Docker

A few different things that I see here that can be improved:

  1. Given that you're using dedicated master nodes, you should ensure that traffic via the cluster service is only routed to data nodes. You can do this by:
spec:
  http:
    service:
      spec:
        selector:
          elasticsearch.k8s.elastic.co/cluster-name: <cluster-name>
          elasticsearch.k8s.elastic.co/node-data: 'true'
  1. Look at setting Availability Zone awareness on your nodes. This will help ensure that Elasticsearch evenly distributes shards across zones.
spec:
  nodeSets:
    - name: masters
      config:
        cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
        node.attr.zone: ${ZONE}
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              env:
                 - name: ZONE
                   valueFrom:
                     fieldRef:
                       fieldPath: metadata.annotations['topology.kubernetes.io/zone']
           initContainers:
             - name: elastic-internal-init-keystore
               env:
                 - name: ZONE
                   valueFrom:
                     fieldRef:
                       fieldPath: metadata.annotations['topology.kubernetes.io/zone']
    - name: data
      config:
        cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
        node.attr.zone: ${ZONE}
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              env:
                 - name: ZONE
                   valueFrom:
                     fieldRef:
                       fieldPath: metadata.annotations['topology.kubernetes.io/zone']
           initContainers:
             - name: elastic-internal-init-keystore
               env:
                 - name: ZONE
                   valueFrom:
                     fieldRef:
                       fieldPath: metadata.annotations['topology.kubernetes.io/zone']

Some additional notes:

  1. Your dedicated master nodes seem very over spec'd, you could probably get away with 2 CPU, 4-8GB RAM for this size of cluster.
  2. 7 Data nodes doesn't nicely "split" across multiple availability zones for high availability. Not sure what you're underlying AWS availability zone architecture looks like, but generally speaking, you'd want to make sure your cluster can handle at least one AZ failure.
  3. Try using GP3 with XFS rather than the default ext4. I've found moderate improvements in using XFS, but it is also somewhat use-case specific.
  4. I see you have ml set as a node role on your data nodes. Given that you're using the basic license, you can probably remove this node role. Also, if you do plan on adding ML to the cluster. It is generally recommended to use dedicate ML nodes.
  5. You should generally ensure that Elasticsearch nodes of the same type are evenly distributed across AZs, this can be done via:
spec:
  nodeSets:
    - name: masters
      podTemplate:
        spec:
          topologySpreadConstraints:
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: <cluster-name>
              maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: <cluster-name>
              maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
    - name: data
      podTemplate:
        spec:
          topologySpreadConstraints:
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: <cluster-name>
              maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: <cluster-name>
              maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
  1. While technically not required, I would generally recommend setting a persistent volume on the master nodes.
1 Like