Debugging extremely slow indexing

I have recently "inherited" an ES cluster at work that used to run just fine (since I inherited at least) but recently has been experiencing extreme performance issues around indexing. We have 20 "processor" containers in our pipeline that each index data to the cluster. Previously each of these containers could index at a rate of at least 50 docs per second (all single requests no bulk api). Originally the cluster was sized with 8 nodes. Over time we saw performance degrade dramatically (<= 2 doc per second). I realized the cluster was managing ~24000 shards! The original developers made some poor choices regarding indexing strategy (many of the shards are small too like on the order of MB's in size). Re-indexing was painfully slow at the time so my thinking was to scale out to get under the 20 shards per GB of heap recommendation (each data node has ~30GB heap) to get head-room to reindex. I scaled out to 48 nodes total ~24000/48 = ~500 shards per node. It made very little difference in my indexing speed (<= 2-3 docs per second). Reindexing was better but still felt slow. The logs for the nodes show a bunch of transport exceptions (not sure what these are about) and GC occurring. I find the GC situation odd because the heap is sized according to the recommendations (half of RAM and not more than 30GB). Below is a log snippet from one node that is fairly "active" I notice other nodes just have a ton of GC messages in their logs (seem idle?). Here is also a link to my current node stats. I am at a total loss of what to do or try next or even where to look, any help would be greatly appreciated.

[2020-12-12T19:58:37,845][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][86198] overhead, spent [339ms] collecting in the last [1s]
[2020-12-12T20:02:52,132][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1a-1}{b0RKcobaS7SeTIYPv2xZZQ}{bfXYE8QMTDC6CRwidyEWdg}{100.103.95.130}{100.103.95.130:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5198]])
[2020-12-12T20:38:30,253][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88585] overhead, spent [312ms] collecting in the last [1s]
[2020-12-12T20:38:31,677][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] removed {{elasticsearch-elasticsearch-data-1b-13}{_GWTqqXVRnmKr9r5fE2-zA}{I4Zdj1yfThGV7MeSh0_ahg}{100.118.195.67}{100.118.195.67:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5747]])
[2020-12-12T20:38:31,720][INFO ][o.e.i.s.IndexShard       ] [elasticsearch-elasticsearch-data-1b-4] [78827-ca-cosumnes_community_services_district_fire_department-apparatus-fire-incident-2018-10-es6apparatusfireincident][1] primary-replica resync completed with 0 operations
[2020-12-12T20:39:35,412][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88650] overhead, spent [319ms] collecting in the last [1s]
[2020-12-12T20:39:41,508][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88656] overhead, spent [349ms] collecting in the last [1s]
[2020-12-12T20:47:06,548][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1b-13}{_GWTqqXVRnmKr9r5fE2-zA}{1HjoV0UoS0Wr0x1EJM8L9w}{100.106.163.66}{100.106.163.66:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5766]])
[2020-12-12T21:42:32,764][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92422] overhead, spent [313ms] collecting in the last [1.1s]
[2020-12-12T21:42:34,495][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] removed {{elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{IpRLByWFTSyU2hAWkPwwgg}{100.105.197.132}{100.105.197.132:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [6264]])
[2020-12-12T21:43:04,028][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-elasticsearch-data-1b-4] failed to connect to node {elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{IpRLByWFTSyU2hAWkPwwgg}{100.105.197.132}{100.105.197.132:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-elasticsearch-data-1b-12][100.105.197.132:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:643) ~[elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:542) ~[elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:316) ~[elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:180) [elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.1.jar:6.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.1.jar:6.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
[2020-12-12T21:43:40,972][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92490] overhead, spent [349ms] collecting in the last [1s]
[2020-12-12T21:43:45,060][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92494] overhead, spent [353ms] collecting in the last [1s]
[2020-12-12T21:49:54,693][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{lJBy6aOPTu66j6br-Y2pMA}{100.106.181.2}{100.106.181.2:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [6284]])
[2020-12-12T21:59:58,904][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93466] overhead, spent [301ms] collecting in the last [1s]
[2020-12-12T22:00:01,905][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93469] overhead, spent [333ms] collecting in the last [1s]
[2020-12-12T22:06:45,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93872] overhead, spent [334ms] collecting in the last [1s]
[2020-12-12T22:06:46,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93873] overhead, spent [331ms] collecting in the last [1s]
[2020-12-12T22:06:48,610][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93875] overhead, spent [309ms] collecting in the last [1s]
[2020-12-12T22:06:52,612][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93879] overhead, spent [302ms] collecting in the last [1s]
[2020-12-12T22:09:57,714][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94064] overhead, spent [333ms] collecting in the last [1s]
[2020-12-12T22:13:56,347][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94302] overhead, spent [319ms] collecting in the last [1.1s]
[2020-12-12T22:13:58,348][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94304] overhead, spent [288ms] collecting in the last [1s]
[2020-12-12T22:14:57,393][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94363] overhead, spent [340ms] collecting in the last [1s]
[2020-12-12T22:15:00,510][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94366] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T22:15:01,510][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94367] overhead, spent [337ms] collecting in the last [1s]
[2020-12-12T22:27:56,060][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95140] overhead, spent [320ms] collecting in the last [1s]
[2020-12-12T22:27:57,061][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95141] overhead, spent [317ms] collecting in the last [1s]
[2020-12-12T22:41:39,556][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95963] overhead, spent [316ms] collecting in the last [1s]
[2020-12-12T22:41:41,556][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95965] overhead, spent [309ms] collecting in the last [1s]
[2020-12-12T22:41:42,557][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95966] overhead, spent [311ms] collecting in the last [1s]
[2020-12-12T22:44:55,637][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96159] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T22:47:46,693][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96330] overhead, spent [318ms] collecting in the last [1s]
[2020-12-12T22:47:47,693][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96331] overhead, spent [329ms] collecting in the last [1s]
[2020-12-12T22:58:04,072][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96947] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T23:11:34,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][97757] overhead, spent [332ms] collecting in the last [1s]

How much heap does each node have?

What type of storage are you using? SSDs?

How many indices do you have in the cluster? How many of these are you actively indexing into?

How are you indexing into the cluster? What bulk size are you using? Are you indexing immutable data or also performing updates?

What type of data are you indexing? Are you using dynamic mappings?

Hey @beckerr and welcome!

In addition to Christian's comments, note that 6.4 is a long way past end-of-life which limits our ability to help. I don't think I have a development environment that works for 6.4 any more. You'll need to upgrade as soon as you can, at least to 6.8 (which is still maintained) and then ideally to 7.x to get the benefits of almost 2 years of improvements over the 6.x series.

A few comments on the handful of log messages you shared:

Both of these removed/added pairs indicate the node restarted, and came back with a different IP address. Is that expected? What happened there? Is that happening a lot? Elasticsearch can cope with nodes dropping out occasionally but recovery is nontrivial so you definitely don't want to see this happening very often. If it's unexpected then this could explain your problems - it's definitely worth addressing this first.

Also the cluster state versions in those messages are suspiciously small. How long ago did you restart the master nodes?

I think this is explained by the preceding removed {{elasticsearch-elasticsearch-data-1b-12} line: we lost a connection to that node so we tried to reconnect, and only then did we learn that it was gone from the cluster. This version has no mechanism to cancel the reconnection attempt so it fails, eventually, with a timeout and a log message. You can address this by upgrading to 7.x.

If your cluster stops adding/removing nodes and is still performing poorly then I'd suggest GET _nodes/hot_threads?threads=99999 when the cluster is under load, because this will give you a very detailed picture of what it's spending all its time doing. Share the output here if you need help understanding it.

1 Like

Hi Christian,

How much heap does each node have?

Each node has ~31GB heap, at least according to these params when I run _node/stats

"heap_committed_in_bytes": 31448563712,
"heap_max_in_bytes": 31448563712,

What type of storage are you using? SSDs?

I am not exactly sure on the storage (I think it's SSD but need to ask Dev-Ops), but I know data nodes are larger mem-optimized (r-type instances) with 64GB ram.

How are you indexing into the cluster? What bulk size are you using? Are you indexing immutable data or also performing updates?

The current code indexes a document at a time (again single requests not bulk). There are at most 20 requests in flight at any given time (20 "processor" containers...each synchronously sending requests from their own message queues). All data is sent as doc_as_upsert because we frequently need to "re-ingest" or replay data and there is no available way in the current processing to treat this as a separate "job".

What type of data are you indexing? Are you using dynamic mappings?

We are indexing fire department data, each document ranges from tens to a few hundred KB (rarely over 200KB). I believe we use dynamic mapping in our template...

"dynamic_templates": [{
        "strings_as_keywords": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }]

However all of our fields have mappings defined and I believe this is just a catch-all in case we enrich our data or add new fields to a doc without any mapping (default to keyword).
Hope this answers everything!

I would recommend the following:

  • Make sure that you are using compressed pointers. The threshold is somewhere below 32GB but I am not sure exactly what it is.
  • Be aware that not efficiently using bulk requests is quite inefficient and will likely result in a lot of small writes and refreshes. This will result in a lot of disk I/O. If you are actively upserting documents in all indices you may however not benefit much from bulk requests anyway as few document per bulk request might end up at a the same shard anyway.
  • I would recommend considering merging all your indices into a single index with a suitable number of shards (depends on total data volume). Once you haver done this you would also benefit from starting to use bulk requests.
  • Check if you are using nested documents. Each nested document within the main document is indexed as a separate document behind the scenes and all subdocuments are reindexed for every update. Updates will therefore get slower and slower as documents grow.
  • I would also recommend looking at the hot threads API to see what the nodes are doing. You may also want to look at disk I/O, e.g. using iostat, to see if this might be a limiting factor.
  • Upgrade to a more recent version as David recommended.
  • As you only send single documents only one shard will be working per concurrent client. You may also want to increase the number pf client connections to make sure all nodes are busy.
1 Like

Thank you for the explanation David,

Upgrading at this particular moment is something we are working on but it will take some time for us to migrate over and have to keep our current production cluster running until then (we are getting backed up in out ingest pipeline and this needs to be solved before I can comfortably focus on upgrading). We do run our nodes on spot-instances so I wonder if those nodes were being rescheduled? This doesn't happen too often when I look at our cluster (sometimes a single node or two after 5-12 hours). Either way I would expect to see spikes in performance if that was the culprit. Dev-ops at one point had also set our total_shard_count to 500 at one point and this caused issues too (since all shards could not be reallocated at that limit). I increased that value yesterday around noon PST and the cluster went fully green afterwards and has remained so. I will try checking the hot threads like you suggested and get back to you.

Thank you Christian,

I am definitely going to be reindexing and changing our index strategy as well in the near future. Currently each fire department gets a new index w/ 2 shards and 1 replica each month. Each index like I mentioned is maybe tens to hundreds of MB's at most for the largest indices. When I view department data based on year our largest customer has only 20GB for a single year. I plan on indexing/binning based on year and reducing shard count to 1 for each index. I ran the hot_threads?threads=9999 call (as you and David suggested) and here is a copy of the output. I could definitely use some help from you or @DavidTurner in interpreting it though.

I think I confirmed we are using compressed pointers already or at least /_nodes shows "using_compressed_ordinary_object_pointers": "true", on every single node.

One other question is do either of you think it is worth trying to scale out again? At least in the short-term? I am just not sure if this is worth doing or if it will yield much better results.

The hot threads output you shared indicates that the node is virtually idle, so I do not think scaling out will help. If you have a fixed number of clients and requests in flight at any time and these start to take longer, throughput will suffer. I would recommend trying to increase the number of threads that index into the cluster or start using bulk requests and see if that makes any difference. Also look into how your data is structured and mapped. If you are using nested documents (is a possibility as you have quite large documents) throughput is likely to drop as documents get larger and larger and more complex.

All nodes look pretty idle except, strangely, elasticsearch-elasticsearch-data-1b-12 which appears to have a number of threads busy writing to replicas. It is a bit suspicious that this one node is busy (but only a bit, I don't think it's worth following up right now).

So I agree with what Christian said, the bottleneck appears to be outside the cluster. Try pushing it harder!

I also note you're using the third-party ReadOnlyREST plugin. That isn't obviously a problem for you, but it's been observed to cause problems for others. This is another reason to upgrade: security is included in the default (cost-free) basic license from 6.8 onwards so there's no need to rely on third parties for this feature any more.

1 Like

Understood, we have other application components that currently rely on this version/plugins just to keep our data flowing for users as we reindex into a new version. In the meantime, the current pipeline is also lagging behind (data not current for the user). I pushed the processing like you requested and let each container (20 of them) run up to 10 requests simultaneously (at most 200 requests in flight at any moment). Performance dropped even further :confused:? I noticed there were many request timeouts (2 mins+). To answer @Christian_Dahlqvist, we have one nested type in the message total. If the nodes are mainly idle I wonder what would be limiting requests so heavily...grasping at straws but, is this a client issue potentially? I'm going to try to bulk some of this in the meantime and hope for the best.

Another thing that could slow down indexing like this is if you for most new documents had new fields being added to the mapping as this would cause a cluster state to be performed for each request. This would get slower as the size of the cluster state grows (increasing shard counts and ever growing mappings). It might be worth looking at the mappings for a few of the indices and checking whether there is any fields present that seem dynamically generated, e.g. numbers or dates.

If you can not find anything in the mappings I would recommend disabling the ReadOnlyREST plugin and see if this makes any difference. We do not have any experience with this and it would be useful to eliminate this as a potential cause. If that turns out to make a difference I would recommend upgrading to Elasticsearch 6.8 in order to use the standard security.

Just seen this, yikes, don't use spot instances for your master/data nodes, that's a sure way to lose data in the long run. Is your cluster health mostly green? How long does it take to get to green again after a node is rescheduled like that?

Quite possibly. The response to an indexing request includes a took field indicating how long Elasticsearch spent actually handling it. Can you log this value, along with the request duration experienced by the client? That'd tell us a bit more about whether the slowness is within Elasticsearch or within the client or somewhere between the two.

This thread has a wealth of info. Thanks @Christian_Dahlqvist and @DavidTurner for such an enriching discussion.

@beckerr - Welcome to the community.

Would you mind sharing why do you have 58 nodes when most of your data nodes have >90% free disk space? Is it that you plan to load data later and hence planned capacity accordingly?

Since you mentioned 64GB RAM, my guess is you are making use of r5.2xlarge.elasticsearch instance type as per this.

You also have 6 client nodes. Do you send the writes to client nodes or to data nodes?

Can you write a simple indexing request in curl, put it in for loop or wrap it in python/shell script and try to index and measure how much time is takes for indexing say 100 records to complete?

Also, is your client located in the same region as ES Service?

Thanks for the follow up @Christian_Dahlqvist and @DavidTurner. Here is an update for you with more info: First I talked with dev-ops and we are not using SSD storage on our instances due to cost currently. To avoid data loss with spot-instances we are using persistent volume claims. It's hard for me to catch exactly when a node is being rescheduled but typically when it happens I see active_shards_percent_as_number drop to at most to ~95%. It turns back to green in less than an hour (could be shorter but this is just based on previous casual observation).

I implemented bulk ingest (although not at the scale I wanted) and it has helped somewhat. Good to know this is a potential avenue to explore further though.

To answer Christian on the mappings. Our template has most fields defined although I would say there are approx. 10-25 that may be added dynamically at any given time. It seems like it should not cause this much of a slow down though? I would be grateful for any pointers or docs on how to look into this further (especially the ever-growing mappings part).

How much data (total disk space) do you have stored in the cluster? If you are having slow storage, having shards frequently recover will not help as it causes additional disk I/O. The larger the cluster state is (depend on size of mappings and shards it need manage) the longer it will take to process and the larger the cluster is, the longer it will take to propagate to the nodes.

Thank you @sandeepkanabar,

Would you mind sharing why do you have 58 nodes...

So the reason for the (it is actually) 57 nodes (48 of which are the data nodes) (3 masters, one 6 clients), is that We have 12241 primary shards (1 replica) and originally we had only 8 data nodes total. This put us wayyy above the 20 shard per GB heap recommendation (each node has approx 30GB heap = approx 600 shards per node max) so we first attempted to scale way out to see if this was the culprit or would help.

Since you mentioned 64GB RAM...

I would have to verify exact instance type with dev-ops although I will note we are not using managed AWS service. This is an internal cluster we manage with spot + kops/k8s.

You also have 6 client nodes. Do you send the writes to client nodes or to data nodes?

I believe all writes are sent to client nodes only. I am not sure how to confirm this though, can you advise?

Also, is your client located in the same region as ES Service?

The client is located in same region

We have about 233 GB total.Here are my cluster stats for reference.

That is very, very little data given the number of shards and nodes. I would recommend that you do prioritize bringing the shard count down significantly. 233GB would easily fit in a single index with a very reasonable number of primary shards, e.g. 8, 10 or 12. A minimal size cluster with 3 master/data nodes should be able to handle this if the shard count was lower and I do not understand why you necessarily would need coordinating-only nodes either.

With a cluster that size I would recommend running it on non-spot instances and I would expect your performance to be a lot better. It should also be cheaper than the current setup even if you added fast storage.

Add 3 new data nodes that are not using spot instances. Use shard allocation filtering to make these a separate zone compared to the rest of the cluster. Create a new single index that will hold all data going forward and make sure it has the correct mappings and allocate this to this new zone. Also make sure it is created so that you later can use the split index API to expand it if required.

If you have time where no indexing/updating occurs (or when you can stop indexing) script reindexing of old indices to the new and then drop the old indices. You might even be able to add an alias with the old index name to the new consolidated index, although I am not sure how well the cluster will handle several thousand indices. As the number of indices and shards in the cluster shrinks you can gradually decommission the old nodes. Once all old indices are gone you can change your indexing solution to use a single index instead.

1 Like

Whoa! That's way too much considering you've just about 233 GB data in all. In addition to Christian's answer I would recommend setting up a curator (you can use one of your data nodes to set this up since you don't have managed AWS) and run curator on it to shrink the number of shards or else re-index and reduce the no of shards. One shard is good enough to hold 30-50 GB of data. More number of shards helps with indexing speed providing you've more nodes to distribute the shards.

Also, the r4 instances are not optimised for I/o operations but rather for memory. Do you run heavy queries in this cluster?

And as Christian said, your limited dataset doesn't warrant separate client nodes at all. 48 data nodes is way too much and you'll end up paying $$ for something you are not using at all since 90% disk space is free in each of the 48 data nodes.

To be fair to the OP, they've already put a bunch of work into reducing their shard count from 24k to 12k, and have scaled out the cluster to (more than) accommodate their remaining excess of shards. There's definitely cheaper ways to run a 233GB cluster, but their focus at the moment is the indexing performance and I don't really see how the size of the cluster is a contributing factor in that. I know of 50-node clusters that index millions of docs per second, this one can't even make it to double figures.