Multiple child inastances of a single client or multiple clients, which is better for bulk indexing in large rates?

Hi Im using Elasticsearch v7.5.0 and I have a huge number of documents being ingested per second, as per the documentation it is recommended to use multiple clients for bulk indexing to reduce load. Can I get the same results if I'm using multiple child instances of a single client?

child clients

How are you currently indexing data into Elasticsearch? As you have linked to the JavaScript client I assume you are using a custom JavaScript application. Is this correct?

What type ofv data are you indexing? What is the average size ifva document? How much data do you have to index?

What is the size and specification of your cluster?

yes im using nodejs client. im working on a personal project with maximmum 5000 json documents ( mostly firewall logs ) being ingested per second to an 8 node cluster which is running in docker overlay network ( actually 4 virtual machines connected via docker overlay ) with 3 masters 4 data nodes and one coordinate nodes. I'm currently using 8 es clients to bulk ingest data. this 5k mentioned is altogether from these 8 clients per second, a single bulk request may contain data destined to multiple indices having 4 shards and one replica.
Due to high CPU and Network load on my cluster its very unstable (data nodes being disconnected frequently). im just trying to find the max limit where my cluster can hold?

It sounds like you may have hit the limitsof what your Elasticsearch cluster can handle. If the cluster is struggling I see no benefit in increasing load or concurrency from the client side.

How many indices and shards are you concurrently indexing into? Are you using time-based indices?

What is the specification of the VMs and how much resources are assigned to the different nodes?

Which nodes are you sending bulk requests to?

What type of storage are you using for Elasticsearch?

At a time my javascript clients are ingesting data to almost 15 indices. Im not using any in build ilm or time based indices.

My VM configurations and the nodes running are

PC1 ( 6 Core/20 GB RAM/1TB HDD)

  • master-a (HEAP 2GB)
  • data-1 (HEAP 16 GB)

PC2 ( 6 Core/20 GB RAM/1TB HDD)

  • master-b (HEAP 2GB)
  • data-2 (HEAP 16 GB)

PC3 ( 4 Core/15 GB RAM/1TB HDD)

  • master-c (HEAP 2GB)
  • data-3 (HEAP 10 GB)

PC4 ( 4 Core/15 GB RAM/1TB HDD)

  • coordinate (HEAP 4GB)
  • data-4 (HEAP 10 GB)

I'm ingesting data to only data-1 and data-2 .
data-3 and data-4 are not in use as of now (these I'm planning to keep a different set of data in future - mostly non logs) I restricted index allocation only to 1st two data nodes using "index.routing.allocation.require.box_type".

If bulk requests can target all 15 indices with up to 60 primary shards, you are going to end up with a lot of small writes, which can be inefficient, especially if you are using slow HDD storage.

Given that you are using HDDs, which are not ideal for high indexing loads, I would recommend you check iowait and disk utilisation to see wthether this is a bottleneck, e.g. using iostat -x.

I would recommend upgrading to SSDs and ensure each bulk requests target a minimum of different indiices.

When running Elasticsearch the recommendation is to not set the heap above 50% of the RAM available to the node. Your settings seem a lot higher than this, which is not good. If the master node on PC1 is allocated 4GB of RAM, it should have the heap set to 2GB while the data node. should have a heap of 8GB as it has 16GB RAM allocated to it.

Th same applied to the other nodes.

Welcome to our community! :smiley:

Please note that version is EOL and no longer supported, you should be looking to upgrade as a matter of urgency.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.