Indexing Performance in ES 5.1.1

rayharris-ibm · October 16, 2018, 3:59pm

We're trying to figure out why we seem to have hit an ingestion limit in our ES cluster.

Use case: Logging
Version: 5.1.1
Master nodes: 3
HTTP nodes: 3
Data nodes: 50

The servers are OpenStack VMs running on bare metal. We do not over provision the bare metals and there is only one ES data node per bare metal.

Server VM Specs
VCPU: 10
RAM: 64 GB
Disk: 2 TB spinning disk

ES Config

bootstrap.memory_lock: true
cluster.name: elasticsearch_1
discovery.zen.ping_timeout: 60s
discovery.zen.ping.unicast.hosts: 10.173.188.214,10.173.188.215,10.173.188.216
http.cors.allow-origin: /.*/
http.cors.enabled: true
http.port: 9200
indices.fielddata.cache.size: 25%
indices.memory.index_buffer_size: 20%
network.host: _eth0_
node.attr.box_type: hot
node.data: true
node.master: false
node.name: 10.173.131.213-hot
path.conf: /etc/elasticsearch/hot
path.data: /opt/elasticsearch/data
path.logs: /var/log/elasticsearch/10.173.131.213-hot
thread_pool.bulk.queue_size: 2000
thread_pool.search.size: 10
transport.tcp.port: 9300

Note that even though the box_type is "hot", this is not a hot/warm configuration.

jvm.options

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j2.disable.jmx=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j.skipJansi=true
-server
-XX:+AlwaysPreTouch
-XX:CMSInitiatingOccupancyFraction=75
-XX:+DisableExplicitGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC

ES_JAVA_OPTS="-Xms31g -Xmx31g"

Output of localhost:9200/

{
  "name": "10.173.188.218-http",
  "cluster_name": "elasticsearch_1",
  "cluster_uuid": "_LQl2KtyQAyxL-D69YnmOw",
  "version": {
    "number": "5.1.1",
    "build_hash": "5395e21",
    "build_date": "2016-12-06T12:36:15.409Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

Output of localhost:9200/_cluster/health

{
  "cluster_name": "elasticsearch_1",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 56,
  "number_of_data_nodes": 50,
  "active_primary_shards": 3806,
  "active_shards": 7612,
  "relocating_shards": 10,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Output of localhost:9200/_cluster/settings

{
  "persistent": {
    "action": {
      "search": {
        "shard_count": {
          "limit": "5000"
        }
      }
    },
    "cluster": {
      "routing": {
        "rebalance": {
          "enable": "none"
        },
        "allocation": {
          "node_concurrent_recoveries": "2",
          "disk": {
            "threshold_enabled": "true",
            "watermark": {
              "low": "85%",
              "high": "95%"
            }
          },
          "node_initial_primaries_recoveries": "8",
          "enable": "all"
        }
      },
      "info": {
        "update": {
          "interval": "60s"
        }
      }
    },
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "500mb"
      }
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "cluster_concurrent_rebalance": "10",
          "node_concurrent_recoveries": "6"
        }
      }
    }
  }
}

We're using bi-hourly indices (hours 01-12 instead of 00-23) and keeping 7 days worth of indices. Our indices have 45 primary shards and 1 replica shard per primary.

Our sustained ingestion rate is around 20K logs per second. This had been fine up to this week when the team that uses this cluster started sending around 40K logs per second and the cluster can't keep up with the incoming logs.

We're considering scaling out the cluster and adding more data nodes, but we're not sure that would help. We'd like to look at configuration changes first before spending more money on hardware.

Any suggestions on what changes we can make to improve performance would be appreciated. If there's additional information that would help analyze the situation, please let me know.

Thanks,
Ray

Christian_Dahlqvist · October 16, 2018, 5:27pm

What does disk I/O and iowait look like during indexing?

rayharris-ibm · October 16, 2018, 6:28pm

This is from one of the data nodes.

$ iostat -x 5 vda
Linux 4.4.0-89-generic (elasticsearch-data-222) 	10/16/2018 	_x86_64_	(10 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.62    0.00    1.30    3.49    0.15   61.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00  1065.14   64.63 1015.60  2300.72 18975.14    39.39     0.99    0.91    2.85    0.79   0.42  45.62

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          71.88    0.00    2.74    3.99    0.14   21.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   830.00    0.60  369.40    60.80  9457.60    51.45     1.85    5.00    2.67    5.01   1.94  71.92

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          68.98    0.00    2.68    5.03    0.34   22.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   804.20  100.00  411.00  4390.40 21101.60    99.77     6.85   10.98   11.42   10.87   1.45  73.92

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          68.62    0.00    2.11    3.64    0.18   25.45

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   718.00    3.20  459.00    12.80 25564.80   110.68     2.27    7.58    0.00    7.64   1.21  56.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          59.96    0.00    1.00    1.64    0.04   37.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   301.00   88.60  163.80   354.40  2888.80    25.70     0.82    3.27    3.81    2.97   0.89  22.56

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          61.26    0.00    1.55    0.42    0.12   36.65

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    77.40   19.20   47.00    76.80  7876.00   240.27     0.50    7.56    0.04   10.64   0.93   6.16

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          66.62    0.00    2.41    4.04    0.39   26.54

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00   400.20    2.00  341.20    86.40 41750.40   243.80     8.21   23.93   80.40   23.60   1.43  49.04

system · November 13, 2018, 6:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node high CPU Elasticsearch	19	3645	February 26, 2018
Our elastic search query performance is VERY low Elasticsearch	12	1590	May 11, 2017
Understanding ES performance metrics Elasticsearch	1	463	April 6, 2018
Loading 500 GB of data / 3500000000 documents to ES cluster Elasticsearch	13	2123	November 4, 2022
Server config for cluster Elasticsearch	2	404	January 12, 2020

Indexing Performance in ES 5.1.1

Related topics