Elasticsearch 6.3 CPU load & Disk I/O increased

Itay_Bittan · September 12, 2018, 12:26pm

Hi,

I'm working on upgrading our Elasticsearch cluster from 1.4.3 to 6.3.0
Both old & new cluster has 3 master nodes, 4 client (coordinating) nodes and 11 data nodes (total 18 nodes).
Cluster runs on AWS. Instance types doesn't changed except data nodes which improved! (16 cores instead of 8) and we are working with ephemeral instead of EBS to improve performance.

Both cluster has the same data:
~1700 indices
~ 7000 shards
~550M docs
830 GB total

both cluster handle up to 1000 requests per second. The request are identical (we are duplicating traffic to the new cluster).

We did the necessary changes to query so they will work on ES6, nothing else.

somehow, the new cluster perform much more I/O.

new cluster (iostat):
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 4.00 0.00 27.50 0 27
xvdb 3325.00 182836.00 48.00 182836 48
xvdc 3317.00 182888.00 40.00 182888 40
md127 25986.00 368484.00 88.00 368484 88

old cluster (iostat):
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 0.00 0.00 0.00 0 0
xvdb 732.00 14980.00 1532.00 14980 1532

In addition,
New cluster load average is much higher than the old one and I see higher load each time the Disk I/O is high.

Any idea what can cause this behavior?

Thanks

Itay_Bittan · September 12, 2018, 12:31pm

adding more information:

new cluster index settings:

s_8770202_1536039001363: {

settings: {
- index: {
  - routing: {
    - allocation: {
      - include: {
        
        zone: "SERVER"}}},
  - search: {
    - slowlog: {
      - level: "warn",
      - threshold: {
        
        fetch: {
        
        warn: "500ms"},
        
        query: {
        
        warn: "500ms"},},}},
  - number_of_shards: "1",
  - provided_name: "s_8770202_1536039001363",
  - creation_date: "1536039040790",
  - analysis: {
    - normalizer: {
      - lowercase_normalizer: {
        
        filter: [
        
        "lowercase"],
        
        type: "custom",}},
    - analyzer: {
      - lowercase_analyzer: {
        
        filter: [
        
        "lowercase"],
        
        type: "custom",
        
        tokenizer: "standard",}},},
  - number_of_replicas: "6",
  - uuid: "qPWQuhTDSmWwK2cCzRHgaw",
  - version: {
    - created: "6030099"},}}

}

old cluster index example:

s_8770202_1527741946002: {

settings: {
- index: {
  - creation_date: "1527742287597",
  - routing: {
    - allocation: {
      - include: {
        
        zone: "SERVER"}}},
  - uuid: "2UiGcKk-StuN2YKrAtQvfg",
  - analysis: {
    - analyzer: {
      - lowercase_analyzer: {
        
        filter: [
        
        "lowercase"],
        
        tokenizer: "keyword",}}},
  - cache: {
    - query: {
      - enable: "true"}},
  - number_of_replicas: "6",
  - number_of_shards: "1",
  - latest_valid_version: "1527464715828",
  - version: {
    - created: "1040399"},}}

}

Itay_Bittan · September 12, 2018, 12:38pm

Elasticsearch.yml comparison:

1.4
cluster.name: rcom-es-cluster
cluster.routing.allocation.awareness.attributes: availability_zone
node.name: data-srv-rcom-es-node-5fc6f11
node.master: false
node.data: true
node.zone: SERVER
node.availability_zone: us-east-1a
node.max_local_storage_nodes: 1
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true
path.conf: /etc/elasticsearch
path.data: /storage/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.mlockall: true
http.port: 9200
http.cors.enabled: true
http.cors.allow-origin: /.*/
http.cors.allow-credentials: true
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_concurrent_recoveries: 8
discovery.type: ec2
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
cloud.node.auto_attributes: true
cloud.aws.access_key:
cloud.aws.secret_key:
cloud.aws.region: us-east-1
discovery.ec2.host_type: private_ip
discovery.ec2.availability_zones: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f
discovery.ec2.tag.rcom: true
action.destructive_requires_name: true
index.load_fixed_bitset_filters_eagerly: false
index.refresh_interval: 30s
index.search.slowlog.threshold.fetch.warn: 500ms
index.search.slowlog.threshold.query.warn: 500ms
indices.cache.filter.size: 10%
indices.fielddata.cache.size: 50%
indices.memory.index_buffer_size: 30%
indices.store.throttle.type: none
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ParNew.warn: 1s
threadpool.bulk.queue_size: 3000
threadpool.search.queue_size: 6000

6.3
cluster.name: rcom-es6-cluster
node.name: data-server-25bd51e
path.data: "/storage/elasticsearch"
path.logs: "/var/log/elasticsearch"
network.host: 0.0.0.0
discovery.zen.hosts_provider: ec2
discovery.zen.minimum_master_nodes: 1
discovery.ec2.tag.dy_cluster: rcom-es6-cluster
discovery.ec2.host_type: private_ip
action.auto_create_index: true
action.destructive_requires_name: true
cloud.node.auto_attributes: true
cluster.routing.allocation.node_concurrent_recoveries: 8
discovery.ec2.availability_zones: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f
xpack.security.enabled: false
xpack.ml.enabled: false
node.ml: false
xpack.monitoring.collection.enabled: true
http.cors.allow-credentials: true
http.cors.allow-origin: "/.*/"
http.cors.enabled: true
http.port: 9200
indices.fielddata.cache.size: 50%
indices.memory.index_buffer_size: 30%
node.max_local_storage_nodes: 1
node.master: false
node.data: true
node.ingest: false
search.remote.connect: false
node.attr.zone: SERVER
cluster.routing.allocation.awareness.attributes: zone
node.attr.availability_zone: us-east-1

jaddison · September 12, 2018, 3:51pm

I'm not an expert scaling out elasticsearch nodes, but might the fact that you have number_of_replicas set to 0 in the new cluster be a problem? This is preventing the 'even' distribution of queries across replicas.

Itay_Bittan · September 12, 2018, 4:46pm

thanks @jaddison, it was a bad example and it's not the usual case.
most of my indices has replication factor 6 (I have 8 "serving" data nodes).

system · October 10, 2018, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
0.90.1 I/O 100% at one node Elasticsearch	6	587	July 6, 2017
High read I/O and Load Average after upgrading to Elasticsearch 5.3.2 Elasticsearch	2	2088	August 25, 2017
Migration from ES1.5.4 TO ES 6.3 HDD performance issue Elasticsearch	5	435	August 17, 2018
Elastic Search Random Node High Load Elasticsearch	3	469	July 6, 2017
Very high disk IO while indexing Elasticsearch	10	5757	July 6, 2017

Elasticsearch 6.3 CPU load & Disk I/O increased

Related topics