We are facing issues with Elasticsearch deployment resulting in response time being slower in EKS to that in Docker. In both cases, we are using AWS to provision EC2 instances.
Docker consists of a single EC2 instance with single data volume.
EKS consists of multiple EC2 instances with multiple data volumes.
We are using Gatling for Load Testing and we have around 3TB of data, data nodes are created in chunks of 500GB (7x data nodes = 3500GB total storage)
We confirmed that:
Changing the hardware specification does not affect the response time (memory/CPU increased to match Docker, and even went past beyond that)
Deploying on a single EKS node does not change response time.
We have performed network testing by checking transfer rate and contacted AWS Support to investigate potential bottleneck of load balancer, yet there seems to be no issues.
We have deployed in our EKS sandbox cluster environment without restrictions such as network policies in place, no change here either
We have moved our application (each component at a time), from Docker to EKS and the performance dropped only when the last component (Elasticsearch) has been moved.
Attempted to play around with settings such as changing vm.max_map_count, net.core.somaxconn, fs.file-max, vm.swappiness, ulimit.
Previously we used bitnami helm charts to deploy Elasticsearch, but eventually switched to ECK hoping to see improvements but we have getting the same results.
Deployment:
3x master nodes, each node has 16GB of memory, 4CPUs and config set as sysctl -w vm.max_map_count=262144
7x data nodes, each node has 16GB of memory, 4CPUs and config set as sysctl -w vm.max_map_count=262144
Any questions, ideas, suggestions welcome.
Thanks!
We're using gp3 for both. We even increased the IOPS and Throughput of the EKS cluster (12000 iops and 600MB/s Throughput much greater than what we have on the docker instance (3000 iops and 125MB/s). We've also tested using the same ec2 instance type (m5a.8xlarge). No improvements on our EKS performance unfortunately.
Both setup are receiving the same amount of load for the gatling test.
We initially thought if we increased the cpu/ram of the k8s elastic cluster (much greater than what we had in the docker setup) the performance would increase - however, that made no difference. We've even placed the k8s elastic cluster on a single node to rule out if having the elastic cluster across different AZs was causing the issue - but that made no difference too - both k8s and docker container used same instance type m5a.8xlarge .
The only difference is that in our kubernetes cluster, for high AZ and redundancy of our elastic cluster we have 3 master nodes/pods, 3 data nodes/pods and 3 client nodes/pods. Whereas, in docker we just have a single elastic container.
I think our next step is to convert our 3 master,data,client pods into a single pod (data pod) and see if the performance improves thereafter. So maybe having multiple master, data and client nodes is causing the low performance issue compared to docker container?
A few different things that I see here that can be improved:
Given that you're using dedicated master nodes, you should ensure that traffic via the cluster service is only routed to data nodes. You can do this by:
Your dedicated master nodes seem very over spec'd, you could probably get away with 2 CPU, 4-8GB RAM for this size of cluster.
7 Data nodes doesn't nicely "split" across multiple availability zones for high availability. Not sure what you're underlying AWS availability zone architecture looks like, but generally speaking, you'd want to make sure your cluster can handle at least one AZ failure.
Try using GP3 with XFS rather than the default ext4. I've found moderate improvements in using XFS, but it is also somewhat use-case specific.
I see you have ml set as a node role on your data nodes. Given that you're using the basic license, you can probably remove this node role. Also, if you do plan on adding ML to the cluster. It is generally recommended to use dedicate ML nodes.
You should generally ensure that Elasticsearch nodes of the same type are evenly distributed across AZs, this can be done via:
eck-elasticsearch:
fullnameOverride: eck-elasticsearch
version: 7.16.3
annotations:
eck.k8s.elastic.co/license: basic
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone" # allows specifying which node label should be used to determine the availability zone name
http:
service:
spec:
selector:
elasticsearch.k8s.elastic.co/cluster-name: eck-elasticsearch
elasticsearch.k8s.elastic.co/node-data: 'true' #Enable traffic routing via data nodes only
tls:
selfSignedCertificate:
disabled: true
updateStrategy:
changeBudget:
maxSurge: 3
maxUnavailable: 1
nodeSets:
- name: masters
count: 3
# podDisruptionBudget:
# spec:
# minAvailable: 2
# selector:
# matchLabels:
# elasticsearch.k8s.elastic.co/cluster-name: quickstart
config:
node.roles: ["master"]
#Enable ES zone awareness (node and zone) for even distribution of shards.
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
node.attr.zone: $ZONE
node.store.allow_mmap: false
podTemplate:
spec:
containers:
- name: elasticsearch
env:
# specify that the annotation that was added on the Pod by the operator due to the `eck.k8s.elastic.co/downward-node-labels` annotation feature should be automatically transformed to an env var
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
#Enable master nodes to be evenly balanced across AZs
topologySpreadConstraints:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: eck-elasticsearch
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: eck-elasticsearch
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
initContainers:
- command:
- sh
- "-c"
- sysctl -w vm.max_map_count=262144
name: sysctl
securityContext:
privileged: true
runAsUser: 0
- command:
- sh
- "-c"
- bin/elasticsearch-plugin install --batch mapper-annotated-text
name: install-plugins
securityContext:
privileged: true
- name: data
count: 7
volumeClaimTemplates:
- metadata:
name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: encrypted-gp3-retain
config:
node.roles: ["data", "ingest", "transform"]
#Enable ES zone awareness for even distribution of shards.
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
node.attr.zone: $ZONE
node.store.allow_mmap: false
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: zone
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
resources:
requests:
cpu: 4
memory: 16Gi
limits:
cpu: 4
memory: 16Gi
#Enable data nodes to be evenly balanced across AZs
topologySpreadConstraints:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: eck-elasticsearch
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: eck-elasticsearch
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
initContainers:
- command:
- sh
- "-c"
- sysctl -w vm.max_map_count=262144
name: sysctl
securityContext:
privileged: true
runAsUser: 0
- command:
- sh
- "-c"
- bin/elasticsearch-plugin install --batch mapper-annotated-text
name: install-plugins
securityContext:
privileged: true
Just FYI I've taken the feedback onboard and adjusted our helm chart for shard balancing. Currently, I've loaded some partial data sets and cluster seems to be performing well so far (gatling results showing 2-3s response time for GET requests) However, this could be due to less data sets being loaded. I will load full data sets and let you know how our cluster performs with all added data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.