How to Insert 50Million documents per 30sec in elasticsearch cluster?

Hi all,
Iam facing difficulty to insert 50million documents per 30 sec from source to Elasticsearch cluster. I have 7 sources so total = 350million documents per 30 sec. I have one machine with 500GB Ram, 500Tb storage and 160 cores. Its an on-prem solution so i can't scale horizontaly for now. I will be getting data into kafka broker.

Iam using cluster mode in nodejs to spawn multiple threads to read from kafka broker and insert into elasticserch parellely. I tried creating different scenerios with varying no. of nodes, shards, batch size for bulk insert and no. of threads used to insert into elasticsearch but Iam not able to insert the data in the required time. I want to insert data in near real time. Little delay in insertion is ok but should not go for days to insert the old data. Write is priority for me not read. I have done basic optimizations for bulk insert mentioned in elasticsearch documentation. Iam having hourly index.

Can someone suggest how to approach this, is it possible to achieve this in this hardware? Should i use multiple clusters and insert into cluster using seperate threads? What should be no. of shards, nodes, batch size for bulk insert and threads from nodejs process?

Please suggest

50 million documents per 30 seconds is around 1.7 million documents per second, which is quite a lot.

What is the average size of your documents?

What is the retention period of your data?

I have never tried to scale a cluster on a single node, but if we look at the CPU and RAM resouces available it corresponds to 10 nodes with 16 CPU cores and 50GB RAM. As far as I know, most people with large hosts tend to run multiple Elasticsearch instances as Elasticsearch does not necessarily scale vertically to that limit.

If we assume 10 nodes with the above specification, each of them would need to handle 170.000 documents per second. This is more than I have seen achieved so I am not sure you will be able to achieve your goals with the current hardware. Whenever I have heard of use cases with this volume of writes, they have as far as I can recall been achieved on significantly larger clusters.

Also note that indexing in Elasticsearch is very I/O intensive, and it is often the storage that is the bottleneck rather than memory or CPU. Fo optimal indexing speed Elasticsearch required storage with throughput and latency in line with what fast local SSDs can provide. Given that you specify 500TB of storage it sounds like you have some kind of networked storage, which often performs significantly below what local SSDs do. I would expect this to likely be your bottleneck.

1 Like

What Christian says is good advice, but just one more thing:

I expect that's not necessary, you can achieve this sort of indexing rate in a single cluster. However like Christian I've also never seen it done with a single massive host. I suppose it might be possible with some careful tuning and isolation between ES nodes but I'd expect you to encounter bottlenecks and other contention problems that would be easily avoided with more/smaller hosts.

As mentioned, I'm not sure you can achieve this event rate with a single node, but maybe you can improve it.

Can you provide a little more information about how are you using Elasticsearch?

I have one machine with 500GB Ram, 500Tb storage and 160 cores.

But how are you running elasticsearch? Which version are you using? Is it installed direct on this bare metal or are you using virtualization? Also, what is the disk type? What is the filesystem of the data path of Elasticsearch? Depending on the filesystem and how it is mounted you can improve the speed.

Iam using cluster mode in nodejs to spawn multiple threads to read from kafka broker and insert into elasticserch parellely.

What is the bulk size you are using? Any reason to build a node.js application instead of an existing tool like logstash?

Also, have you read the tune for indexing speed documentation already?

Hi all thanks for quick response.
For answering the questions:

  1. My document size is 72 bytes
  2. Iam using docker to start elasticsearch on Ubuntu Os
  3. Retention period is 2 years
  4. Version 8.6.2
  5. Lvm created for multiple SSD for storage.
  6. Filesystem I'm not sure but default Nfs only I think

Actually my data is in binary format , so I need to make sense of some data convert it into json and then do bulk insert into elastic.
If I use logstash I'm not sure those calculations would be easier to do or not. I have to check on performance from custom script or logstash.

Batch size I have experimented with 10k 5k even 2million but I'm not able to find a reasonable value so that my Ram consumption is also not too high.
Can you suggest me some pointers so that I can try those setups and check if any luck. Eg. How many no. Of nodes or seperate clusters, batch size and no. Of concurrent threads to use.

If I had to guess, that's where I'd start looking for bottlenecks.


What is the average size of the JSON documents once the binary data has been transformed into JSON form?

I would try setting up a cluster of 10 nodes where each has 50GB RAM, 25GB heap and 16 CPU cores.

As a starting point for testing indexing performance I would recommend setting up a data stream with 10 or 20 primary shards and no replicas (1 or 2 per node should be a good starting point). Configure an ILM policy to roll the index over at 50GB shard size (index size will be 10 or 20 times that depending on the number of primary shards). Set the maximum time in the how zone to a few days days (it will roll over a lot more frequently than that due to the size limit). Make sure that you have good monitoring in place so you have some data to work with when identifying bottlenecks.

Then start indexing data into the cluster. I assume you will need a quite large number of concurrent indexing threads and these should ideally be distributed across all data nodes. Select a bulk size that results in bulk requests of a couple of MB in size.

Once you start indexing, monitor network usage (could potentially be the bottleneck), CPU, RAM, heap as well as disk I/O (latency, throughput and IOPS). I also suspect storage is likely to be the first bottleneck hit based on the information you have provided.

It is also worth looking at the numbers provided. If we assume the average size of the JSON documents is 150 bytes (might be low as JSON is a lot more verbose than binary data) this corresponds to 14Gb per minute or around 20TB per day if I have calculated correctly. Over two years this is around 14.6 PB. The ratio between raw JSON size and the size it takes up on disk when indexed will vary depending on the mappings and index settings used. Expacting it to shrink to around 4% of the original size is however IMHO not realistic, so I suspect your storage is undersized as well.

1 Like