Slowness in ES bulk inserts

We are running behind the schedule to achieve something and we need help. Please take a look and suggest something.

Issue: Slowness on bulk Indexing.
For 10000 records its taking 27560 ms
For 500 records its taking about 4448 ms

  • We want to index about 100,000 documents and its taking 372023 ms which is too much.

  • We tried different bulk size from 500, 20000,100000 but not able to achieve desired result.

  • We tried with JestClient, RestHighLevelClient, BulkProcessor but nothing is helping

System Configurations:

  • OS: Linux (debian 9.12)
  • Standard DS2 v2 (2 vcpus, 7 GiB memory)
  • 3 nodes configured

jvm.options:
-Xms5g
-Xmx5g
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

elasticsearch.yml

 // # ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: hydroperformance
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: hydoperfes0
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /opt/bitnami/elasticsearch/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 0.0.0.0
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: ["hydoperfes0","hydoperfes1","hydoperfes2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

transport.tcp.port: 9300
network.publish_host: 10.0.0.6
discovery.initial_state_timeout: 5m
gateway.expected_nodes: 3
indices.memory.index_buffer_size: 30%

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Thanks In advance

You should not set this higher than 50% of available RAM, which in this case is 3.5GB.

I strongly recommend using the default settings and not override these expert level settings.

Have you identified what is limiting throughput? Assuming you are not seeing long or frequent GC it is often either CPU usage or disk I/O, so it is important to use fast storage. See this section in the docs for more information and tips. Also make sure you are indexing into enough shards to get all the nodes involved. If you index into a single index with the default 1 primary shard only one or two nodes will be doing work.

In the docs linked by @Christian_Dahlqvist it briefly mentions "unsetting" refreshes, but doesn't expand. One can disable refresh entirely by setting refresh_interval=-1 on an index's settings if you need to perform a 1-time large-scale ingestion event. Docs will be largely unsearchable during until refresh is re-enabled, but it can speed things up for initial data loads.

Are you on Azure, correct? Is your storage using SSD? Have you tested to see if how many IOPS it is getting?

I had a problem a couple of years ago on Azure that even using the Premium SSD based storage the IOPS were pretty low, only solved after opening a ticket to support to check the hardware behind it, which was failing.

I tried disabling it from settings API on index, that didn't help in our case

Yes. its on Azure portal. I will check on SSD and IOPS. I will post my findings

I will try setting 50% RAM and reverting other configurations to see how it behaves.

@Christian_Dahlqvist, We tried following as you suggested

Did this change didn't get any improvement for indexing about 500 document performance was the same its was taking average 3seconds.

Reverted this still no gain.

We have by 5 shards configured. Also, followed the link and did following:

  • Changed setting for a index , set refresh_interval to 90s and did indexing but no performance gain.

  • Used multiple worker thread to index, but each individual thread was taking average 3 seconds for 500 documents.

  • We just have 1 replicas.

Just for your information we have unmanaged standard HDD, do you think that might be slowing us down?

We are using unmanaged Standard HDD. Do you think that will not give us milliseconds response time for indexing 500 documents?

Storage performance is often the limiting factor. I would recommend you try with premium storage and see what impact that has.

Probably not, those unmanaged Standard HDD are pretty slow for elasticsearch.

You should try the Premium tier, even the unmanaged premium tier is better than the HDD, they offer better speed, but they are also more expensive.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.