Elasticsearch 6.3 CPU load & Disk I/O increased

Hi,

I'm working on upgrading our Elasticsearch cluster from 1.4.3 to 6.3.0
Both old & new cluster has 3 master nodes, 4 client (coordinating) nodes and 11 data nodes (total 18 nodes).
Cluster runs on AWS. Instance types doesn't changed except data nodes which improved! (16 cores instead of 8) and we are working with ephemeral instead of EBS to improve performance.

Both cluster has the same data:
~1700 indices
~ 7000 shards
~550M docs
830 GB total

both cluster handle up to 1000 requests per second. The request are identical (we are duplicating traffic to the new cluster).

We did the necessary changes to query so they will work on ES6, nothing else.

somehow, the new cluster perform much more I/O.

new cluster (iostat):
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 4.00 0.00 27.50 0 27
xvdb 3325.00 182836.00 48.00 182836 48
xvdc 3317.00 182888.00 40.00 182888 40
md127 25986.00 368484.00 88.00 368484 88

old cluster (iostat):
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 0.00 0.00 0.00 0 0
xvdb 732.00 14980.00 1532.00 14980 1532

In addition,
New cluster load average is much higher than the old one and I see higher load each time the Disk I/O is high.

Any idea what can cause this behavior?

Thanks

adding more information:

new cluster index settings:

s_8770202_1536039001363: {

  • settings: {
    • index: {
      • routing: {

        • allocation: {
          • include: {
            • zone: "SERVER"}}},
      • search: {

        • slowlog: {
          • level: "warn",

          • threshold: {

            • fetch: {

              • warn: "500ms"},
            • query: {

              • warn: "500ms"},},}},
      • number_of_shards: "1",

      • provided_name: "s_8770202_1536039001363",

      • creation_date: "1536039040790",

      • analysis: {

        • normalizer: {

          • lowercase_normalizer: {
            • filter: [

              • "lowercase"],
            • type: "custom",}},

        • analyzer: {

          • lowercase_analyzer: {
            • filter: [

              • "lowercase"],
            • type: "custom",

            • tokenizer: "standard",}},},

      • number_of_replicas: "6",

      • uuid: "qPWQuhTDSmWwK2cCzRHgaw",

      • version: {

        • created: "6030099"},}}

}

old cluster index example:

s_8770202_1527741946002: {

  • settings: {
    • index: {
      • creation_date: "1527742287597",

      • routing: {

        • allocation: {
          • include: {
            • zone: "SERVER"}}},
      • uuid: "2UiGcKk-StuN2YKrAtQvfg",

      • analysis: {

        • analyzer: {
          • lowercase_analyzer: {
            • filter: [

              • "lowercase"],
            • tokenizer: "keyword",}}},

      • cache: {

        • query: {
          • enable: "true"}},
      • number_of_replicas: "6",

      • number_of_shards: "1",

      • latest_valid_version: "1527464715828",

      • version: {

        • created: "1040399"},}}

}

Elasticsearch.yml comparison:

1.4
cluster.name: rcom-es-cluster
cluster.routing.allocation.awareness.attributes: availability_zone
node.name: data-srv-rcom-es-node-5fc6f11
node.master: false
node.data: true
node.zone: SERVER
node.availability_zone: us-east-1a
node.max_local_storage_nodes: 1
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true
path.conf: /etc/elasticsearch
path.data: /storage/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.mlockall: true
http.port: 9200
http.cors.enabled: true
http.cors.allow-origin: /.*/
http.cors.allow-credentials: true
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_concurrent_recoveries: 8
discovery.type: ec2
discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
cloud.node.auto_attributes: true
cloud.aws.access_key:
cloud.aws.secret_key:
cloud.aws.region: us-east-1
discovery.ec2.host_type: private_ip
discovery.ec2.availability_zones: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f
discovery.ec2.tag.rcom: true
action.destructive_requires_name: true
index.load_fixed_bitset_filters_eagerly: false
index.refresh_interval: 30s
index.search.slowlog.threshold.fetch.warn: 500ms
index.search.slowlog.threshold.query.warn: 500ms
indices.cache.filter.size: 10%
indices.fielddata.cache.size: 50%
indices.memory.index_buffer_size: 30%
indices.store.throttle.type: none
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s
monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ParNew.warn: 1s
threadpool.bulk.queue_size: 3000
threadpool.search.queue_size: 6000

6.3
cluster.name: rcom-es6-cluster
node.name: data-server-25bd51e
path.data: "/storage/elasticsearch"
path.logs: "/var/log/elasticsearch"
network.host: 0.0.0.0
discovery.zen.hosts_provider: ec2
discovery.zen.minimum_master_nodes: 1
discovery.ec2.tag.dy_cluster: rcom-es6-cluster
discovery.ec2.host_type: private_ip
action.auto_create_index: true
action.destructive_requires_name: true
cloud.node.auto_attributes: true
cluster.routing.allocation.node_concurrent_recoveries: 8
discovery.ec2.availability_zones: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f
xpack.security.enabled: false
xpack.ml.enabled: false
node.ml: false
xpack.monitoring.collection.enabled: true
http.cors.allow-credentials: true
http.cors.allow-origin: "/.*/"
http.cors.enabled: true
http.port: 9200
indices.fielddata.cache.size: 50%
indices.memory.index_buffer_size: 30%
node.max_local_storage_nodes: 1
node.master: false
node.data: true
node.ingest: false
search.remote.connect: false
node.attr.zone: SERVER
cluster.routing.allocation.awareness.attributes: zone
node.attr.availability_zone: us-east-1

I'm not an expert scaling out elasticsearch nodes, but might the fact that you have number_of_replicas set to 0 in the new cluster be a problem? This is preventing the 'even' distribution of queries across replicas.

thanks @jaddison, it was a bad example and it's not the usual case.
most of my indices has replication factor 6 (I have 8 "serving" data nodes).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.