Elasticsearch on AKS

krzysztoftorunski · January 21, 2019, 8:42am

Hi,

I'm testing Elasticsearch cluster on Azure AKS environment.
I'm using 5 nodes AKS cluster (Standard E8s v3). Three nodes are dedicated for Elastic cluster. Three nodes data + master. All pods have additional persistent volume (Premium SSD 1 TB) for data.
We are getting lots of logs like:
[INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-node-1] [gc][6799] overhead
[WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-node-1] [gc][6799] overhead

From time to time we are reaching queue size limit (200).

I used Rally to check Elastic cluster
esrally --pipeline=benchmark-only --target-hosts=elasticsearch:9200 --track=geopoint --challenge=append-fast-with-conflicts

Lap	Metric	Task	Value	Unit
All	Total Young Gen GC		2245.89	s
All	Total Old Gen GC		0.452	s
All	Min Throughput	index-update	21163.6	docs/s
All	Median Throughput	index-update	21969.2	docs/s
All	Max Throughput	index-update	26891.7	docs/s

I created on Azure one VM (E8s v3) to test if this is maybe issue with a sizing of the VM but,
I see huge difference in Max Throughput between three nodes cluster on AKS and one VM.

Lap	Metric	Task	Value	Unit
All	Total Young Gen GC		78.33	s
All	Total Old Gen GC		0.217	s
All	Min Throughput	index-update	94740.5	docs/s
All	Median Throughput	index-update	106737	docs/s
All	Max Throughput	index-update	126790	docs/s

I also tested one node Elastic cluster on AKS.

| All | Total Young Gen GC | | 3424.56 | s |
| All | Total Old Gen GC | | 0.428 | s |
| All | Min Throughput | index-update | 6633.7 | docs/s |
| All | Median Throughput | index-update | 7144.32 | docs/s |
| All | Max Throughput | index-update | 7473.19 | docs/s |

Do you have any experience with Azure AKS.
Maybe I should setup Elastic differently than in VM.

Krzysiek

krzysztoftorunski · January 21, 2019, 8:44am

AKS 3 nodes cluster

Lap	Metric	Task	Value	Unit
All	Total indexing time		67.0337	min
All	Min indexing time per shard		0	min
All	Median indexing time per shard		0	min
All	Max indexing time per shard		12.9383	min
All	Total merge time		17.6462	min
All	Min merge time per shard		0	min
All	Median merge time per shard		0	min
All	Max merge time per shard		4.2327	min
All	Total merge throttle time		1.34357	min
All	Min merge throttle time per shard		0	min
All	Median merge throttle time per shard		0	min
All	Max merge throttle time per shard		0.249117	min
All	Total refresh time		15.7534	min
All	Min refresh time per shard		0	min
All	Median refresh time per shard		0	min
All	Max refresh time per shard		3.03417	min
All	Total flush time		0.0199333	min
All	Min flush time per shard		0	min
All	Median flush time per shard		0	min
All	Max flush time per shard		0.01645	min
All	Total Young Gen GC		2245.89	s
All	Total Old Gen GC		0.452	s
All	Store size		63.3685	GB
All	Translog size		4.93038	GB
All	Heap used for segments		47.8294	MB
All	Heap used for doc values		4.14286	MB
All	Heap used for terms		25.2056	MB
All	Heap used for norms		0.0022583	MB
All	Heap used for points		12.9216	MB
All	Heap used for stored fields		5.557	MB
All	Segment count		339
All	Min Throughput	index-update	21163.6	docs/s
All	Median Throughput	index-update	21969.2	docs/s
All	Max Throughput	index-update	26891.7	docs/s
All	50th percentile latency	index-update	1800.68	ms
All	90th percentile latency	index-update	2468.89	ms
All	99th percentile latency	index-update	3369.75	ms
All	99.9th percentile latency	index-update	4504.29	ms
All	99.99th percentile latency	index-update	5358.83	ms
All	100th percentile latency	index-update	5416.31	ms
All	50th percentile service time	index-update	1800.68	ms
All	90th percentile service time	index-update	2468.89	ms
All	99th percentile service time	index-update	3369.75	ms
All	99.9th percentile service time	index-update	4504.29	ms
All	99.99th percentile service time	index-update	5358.83	ms
All	100th percentile service time	index-update	5416.31	ms
All	error rate	index-update	0	%

krzysztoftorunski · January 21, 2019, 8:44am

One VM

Lap	Metric	Task	Value	Unit
All	Total indexing time		96.1971	min
All	Min indexing time per shard		0.000666667	min
All	Median indexing time per shard		4.93302	min
All	Max indexing time per shard		12.4609	min
All	Total merge time		5.79252	min
All	Min merge time per shard		0	min
All	Median merge time per shard		0.253533	min
All	Max merge time per shard		1.04133	min
All	Total merge throttle time		0.154983	min
All	Min merge throttle time per shard		0	min
All	Median merge throttle time per shard		0.00508333	min
All	Max merge throttle time per shard		0.05465	min
All	Total refresh time		8.19662	min
All	Min refresh time per shard		0.00113333	min
All	Median refresh time per shard		0.54775	min
All	Max refresh time per shard		0.943633	min
All	Total flush time		0.000133333	min
All	Min flush time per shard		0	min
All	Median flush time per shard		0	min
All	Max flush time per shard		0.000133333	min
All	Total Young Gen GC		78.33	s
All	Total Old Gen GC		0.217	s
All	Store size		6.32997	GB
All	Translog size		6.31995	GB
All	Heap used for segments		30.7482	MB
All	Heap used for doc values		0.0181198	MB
All	Heap used for terms		28.1291	MB
All	Heap used for norms		0.0639038	MB
All	Heap used for points		0.964798	MB
All	Heap used for stored fields		1.57222	MB
All	Segment count		192
All	Min Throughput	index-update	94740.5	docs/s
All	Median Throughput	index-update	106737	docs/s
All	Max Throughput	index-update	126790	docs/s
All	50th percentile latency	index-update	315.807	ms
All	90th percentile latency	index-update	755.748	ms
All	99th percentile latency	index-update	2455.04	ms
All	99.9th percentile latency	index-update	5273.72	ms
All	100th percentile latency	index-update	5694.4	ms
All	50th percentile service time	index-update	315.807	ms
All	90th percentile service time	index-update	755.748	ms
All	99th percentile service time	index-update	2455.04	ms
All	99.9th percentile service time	index-update	5273.72	ms
All	100th percentile service time	index-update	5694.4	ms
All	error rate	index-update	0	%

krzysztoftorunski · January 21, 2019, 8:55am

AKS one node cluster
| All | Total indexing time | | 87.0735 | min |
| All | Min indexing time per shard | | 0 | min |
| All | Median indexing time per shard | | 5.83333e-05 | min |
| All | Max indexing time per shard | | 16.0536 | min |
| All | Total merge time | | 52.8072 | min |
| All | Min merge time per shard | | 0 | min |
| All | Median merge time per shard | | 0 | min |
| All | Max merge time per shard | | 9.01507 | min |
| All | Total merge throttle time | | 0.40045 | min |
| All | Min merge throttle time per shard | | 0 | min |
| All | Median merge throttle time per shard | | 0 | min |
| All | Max merge throttle time per shard | | 0.0894833 | min |
| All | Total refresh time | | 19.3158 | min |
| All | Min refresh time per shard | | 0 | min |
| All | Median refresh time per shard | | 0.000133333 | min |
| All | Max refresh time per shard | | 3.26023 | min |
| All | Total flush time | | 0.00588333 | min |
| All | Min flush time per shard | | 0 | min |
| All | Median flush time per shard | | 0 | min |
| All | Max flush time per shard | | 0.00305 | min |
| All | Total Young Gen GC | | 3424.56 | s |
| All | Total Old Gen GC | | 0.428 | s |
| All | Store size | | 25.1363 | GB |
| All | Translog size | | 4.06481 | GB |
| All | Heap used for segments | | 40.8529 | MB |
| All | Heap used for doc values | | 3.37786 | MB |
| All | Heap used for terms | | 23.4827 | MB |
| All | Heap used for norms | | 0.000305176 | MB |
| All | Heap used for points | | 9.73372 | MB |
| All | Heap used for stored fields | | 4.25829 | MB |
| All | Segment count | | 295 | |
| All | Min Throughput | index-update | 6633.7 | docs/s |
| All | Median Throughput | index-update | 7144.32 | docs/s |
| All | Max Throughput | index-update | 7473.19 | docs/s |
| All | 50th percentile latency | index-update | 5206.57 | ms |
| All | 90th percentile latency | index-update | 7434.04 | ms |
| All | 99th percentile latency | index-update | 15225.5 | ms |
| All | 99.9th percentile latency | index-update | 30710.5 | ms |
| All | 99.99th percentile latency | index-update | 41678.9 | ms |
| All | 100th percentile latency | index-update | 41977.5 | ms |
| All | 50th percentile service time | index-update | 5206.57 | ms |
| All | 90th percentile service time | index-update | 7434.04 | ms |
| All | 99th percentile service time | index-update | 15225.5 | ms |
| All | 99.9th percentile service time | index-update | 30710.5 | ms |
| All | 99.99th percentile service time | index-update | 41678.9 | ms |
| All | 100th percentile service time | index-update | 41977.5 | ms |
| All | error rate | index-update | 0 | % |

Christian_Dahlqvist · January 21, 2019, 9:05am

What is the full output of the cluster stats API?

krzysztoftorunski · January 21, 2019, 9:16am

{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "DDL",
"cluster_uuid" : "Z6I3dHv0TPSkOBtcdHFQHw",
"timestamp" : 1548062105067,
"status" : "green",
"indices" : {
"count" : 219,
"shards" : {
"total" : 242,
"primaries" : 226,
"replication" : 0.07079646017699115,
"index" : {
"shards" : {
"min" : 1,
"max" : 6,
"avg" : 1.1050228310502284
},
"primaries" : {
"min" : 1,
"max" : 6,
"avg" : 1.0319634703196348
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.0730593607305936
}
}
},
"docs" : {
"count" : 61318280,
"deleted" : 7633014
},
"store" : {
"size" : "15.2gb",
"size_in_bytes" : 16333773175
},
"fielddata" : {
"memory_size" : "121.9kb",
"memory_size_in_bytes" : 124832,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "27.8mb",
"memory_size_in_bytes" : 29243840,
"total_count" : 1041873,
"hit_count" : 225631,
"miss_count" : 816242,
"cache_size" : 10894,
"cache_count" : 44847,
"evictions" : 33953
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 1623,
"memory" : "48.5mb",
"memory_in_bytes" : 50896554,
"terms_memory" : "31.8mb",
"terms_memory_in_bytes" : 33370510,
"stored_fields_memory" : "4.3mb",
"stored_fields_memory_in_bytes" : 4553352,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "1mb",
"norms_memory_in_bytes" : 1086336,
"points_memory" : "5.2mb",
"points_memory_in_bytes" : 5463968,
"doc_values_memory" : "6.1mb",
"doc_values_memory_in_bytes" : 6422388,
"index_writer_memory" : "0b",
"index_writer_memory_in_bytes" : 0,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "2.3mb",
"fixed_bit_set_memory_in_bytes" : 2455960,
"max_unsafe_auto_id_timestamp" : 1548028810206,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 3,
"data" : 3,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 3
},
"versions" : [
"6.5.4"
],
"os" : {
"available_processors" : 3,
"allocated_processors" : 3,
"names" : [
{
"name" : "Linux",
"count" : 3
}
],
"mem" : {
"total" : "188.7gb",
"total_in_bytes" : 202662752256,
"free" : "54.2gb",
"free_in_bytes" : 58225688576,
"used" : "134.5gb",
"used_in_bytes" : 144437063680,
"free_percent" : 29,
"used_percent" : 71
}
},
"process" : {
"cpu" : {
"percent" : 1
},
"open_file_descriptors" : {
"min" : 596,
"max" : 663,
"avg" : 633
}
},
"jvm" : {
"max_uptime" : "1.7d",
"max_uptime_in_millis" : 151748905,
"versions" : [
{
"version" : "11.0.1",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "11.0.1+13",
"vm_vendor" : "Oracle Corporation",
"count" : 3
}
],
"mem" : {
"heap_used" : "36.9gb",
"heap_used_in_bytes" : 39669175176,
"heap_max" : "96gb",
"heap_max_in_bytes" : 103079215104
},
"threads" : 137
},
"fs" : {
"total" : "2.9tb",
"total_in_bytes" : 3243205423104,
"free" : "2.9tb",
"free_in_bytes" : 3223063965696,
"available" : "2.9tb",
"available_in_bytes" : 3223013634048
},
"plugins" : [
{
"name" : "ingest-user-agent",
"version" : "6.5.4",
"elasticsearch_version" : "6.5.4",
"java_version" : "1.8",
"description" : "Ingest processor that extracts information from a user agent",
"classname" : "org.elasticsearch.ingest.useragent.IngestUserAgentPlugin",
"extended_plugins" : ,
"has_native_controller" : false
},
{
"name" : "ingest-geoip",
"version" : "6.5.4",
"elasticsearch_version" : "6.5.4",
"java_version" : "1.8",
"description" : "Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database",
"classname" : "org.elasticsearch.ingest.geoip.IngestGeoIpPlugin",
"extended_plugins" : ,
"has_native_controller" : false
}
],
"network_types" : {
"transport_types" : {
"security4" : 3
},
"http_types" : {
"security4" : 3
}
}
}
}

krzysztoftorunski · January 21, 2019, 10:52pm

I added resources in the yaml file.

      resources:
        requests:
          memory: 40Gi
          cpu: "7"

Without requested resources Elastic was using only one CPU.

Now I see that Elastic is using 21 CPU on three nodes.
"nodes" : {
"count" : {
"total" : 3,
"data" : 3,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 3
},
"versions" : [
"6.5.4"
],
"os" : {
"available_processors" : 21,
"allocated_processors" : 21,
"names" : [
{
"name" : "Linux",
"count" : 3
}
],

This normal behavior that without resources -> requests -> cpu Elastic will use only one CPU ?

DavidTurner · January 23, 2019, 10:37am

Which YAML file? I do not recognise these settings. Can you share the whole YAML file?

krzysztoftorunski · January 23, 2019, 12:36pm

    apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch-node
spec:
  selector:
    matchLabels:
      app: elasticsearch # has to match .spec.template.metadata.labels
  serviceName: "elasticsearch"
  replicas: 3
  template:
    metadata:
      labels:
        app: elasticsearch
        elastic-index-name: elasticsearch
    spec:
      serviceAccountName: elasticsearch-logging
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                topologyKey: "kubernetes.io/hostname"
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - elasticsearch
      initContainers:
        - name: chmod-er
          image: acramslhgamsfnds.azurecr.io/busybox:1.27.2
          command: ["sh", "-c", "/bin/chown 1000:1000 /data"]
          volumeMounts:
            - name: data
              mountPath: /data
        - name: init-sysctl
          image: acramslhgamsfnds.azurecr.io/busybox:1.27.2
          command: ["sh", "-c", "sysctl -w vm.max_map_count=262144"]
          securityContext:
            privileged: true
      containers:
        - name: elasticsearch
          image: elasticsearch:6.5.4
          ports:
            - containerPort: 9200
            - containerPort: 9300
          resources:
            requests:
              memory: 40Gi
              cpu: "6"
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: discovery.zen.ping.unicast.hosts
              value: "elasticsearch"
            - name: cluster.name
              value: "DDL"
            - name: discovery.zen.minimum_master_nodes
              value: "2"
            - name: network.host
              value: "0.0.0.0"
            - name: ES_JAVA_OPTS
              value: "-Xms32766m -Xmx32766m -XX:-UseConcMarkSweepGC -XX:+UseG1GC -XX:MaxGCPauseMillis=300 -XX:G1HeapRegionSize=16m"
            - name: path.data
              value: "/data"
            - name: node.master
              value: "true"
            - name: node.data
              value: "true"
            - name: node.ingest
              value: "true"
            - name: node.name
              value: ${HOSTNAME}
            - name: gateway.recover_after_nodes
              value: "3"
          volumeMounts:
            - name: data
              mountPath: /data
      terminationGracePeriodSeconds: 300
      nodeSelector:
        nodefor: elasticsearch

DavidTurner · January 23, 2019, 1:33pm

It's kinda hard to read, since YAML cares about how things are indented and you didn't use the </> button so this vital detail is lost. However, by the looks of it this configures the containers and determines things like the number of CPUs to which they have access, and if you don't ask for more then 1 seems like a reasonable default.

krzysztoftorunski · January 23, 2019, 2:25pm

So without defined requests cpu, Elastic will use only one CPU, even if there are 8 available ?

system · February 20, 2019, 2:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Benchmarks (again) Elasticsearch	11	420	July 6, 2017
Elasticsearch cluster on Azure using ubuntu. The nodes don't see each other Elasticsearch	6	419	July 6, 2017
ELK - Elasticsearch nodes High CPU Elasticsearch	4	454	March 1, 2022
If I have ELK stack running on EC2. How can I make the ES as a cluster? Elasticsearch	12	433	July 6, 2017
Elasticsearch hardware requirement,and benchmarking Elasticsearch	10	2549	July 6, 2017

Elasticsearch on AKS

Related topics