Slow indexing speed on just one index

I looked around the internet and the forums but can't find something similar...

Background

I have a two data node cluster (and one tie-breaker) with about 10 indexes of all the same kind of documents (same properties).
Nine of the indexes are all about 50-100GB of data 10-100M docs. I send batched requests to ingest docs (always index not update) and we get speeds of around 15K docs/s.
The tenth index is 500GB with 700M docs. We send data using the exact everything and get speeds of 1-3k docs/s.

The two data nodes are running on separate hardware backed by NVMe drives with 50GB of available RAM each connected with 10GbE.

I've looked through the setting to contrast and see what might the differences be. What I found is:
fast:

settings.index.number_of_shards = "2"

slow:

settings.index.number_of_shards = "4"
settings.index.refresh_interval = "600s"

Is it the fact that there are twice the number of shards or that there is a 10min refresh interval... or something else I don't know yet?

The lack of the refresh_interval setting on the fast indexes makes me think that might have something to do with it... but before I start making changes in my production cluster I want to have a better understanding of what it does.

I googled it and came to the docs about index update setting which tells me I can update it, and to remove the setting I can set it to null but then for bulk indexing I should set it to "-1" ?!?!

From a stack overflow thread it seems that maybe it has to do with how quickly the ingested data is available to be results of a search? Or maybe how frequently it's flushed to disk?

It would make some sense that the slowness I'm seeing is from forcing translogs to be read more frequently or if I'm writing to drive more frequently but 600s is 10mins and this is not a saw tooth performance issue, it is a static 3k docs/s all day long.

Can anyone help me know

  1. is this smoking gun for my performance issue?
  2. if not, any guesses of where to turn next?
  3. what is the setting settings.index.refresh_interval meant to be/do when changed?
  4. is the default for settings.index.refresh_interval it's absence (I presume null) or "-1"?
  5. is it safe to change while the bulk indexing requests are streaming in or should I stop the bulk requests first?

Thank you for any help.

How many primary shards do the different indices have?

Which version of Elasticsearch are you using?

Is there any difference in document size and/or complexity between the different indices?

Are you specifying document IDs externally or allowing Elasticsearch to assign them?

Do you perform updates and/or deletes?

Are you using any more advanced features that could impact performance, e.g. parent-child, nested documents, ingest pipelines or vectors? If so, does this differ across the indices?

How are you measuring the indexing rate? Also, the 15k docs/s is per indice in the nine fast indices or is overall?

It is not clear how you are measuring this.

The default value for refresh_interval is 1s, so if you do not change it, the index will use this value.

Also, what are the number of shards? Can you run a GET _cat/shards?v and share the results?

Thank you for helping!

I forgot to mention in the original post that this is a manually generated clone of the elastic cloud version we have. I took the settings from our elastic cloud v7.15 cluster and basically just copied and pasted. Same hardware profile, same number of cluster members, etc. And the index settings are all the same... not sure if that helps or hurts, but the elastic cloud version didn't have this delta in indexing speed among the similar indexes.

How many primary shards do the different indices have?

Using the Kibana it says "Total shards 216" but that includes all the .something and other minimally used indexes.
The indexes I most care about are the 11 (I think I said 10 before but it's actually 11) that are basically all the business data.

Which version of Elasticsearch are you using?

{
  "name" : "elastic2",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "xxxxxxxxxxxxxx",
  "version" : {
    "number" : "8.15.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "1a77947f34deddb41af25e6f0ddb8e830159c179",
    "build_date" : "2024-08-05T10:05:34.233336849Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

Is there any difference in document size and/or complexity between the different indices?

That was the first thing I checked and each individual document is about the same size, if you look at the JSON that comes back it's all the same properties and each property is roughly the same with some minor variation in some "description" type properties. Basically all the properties are either strings or ints and some are arrays of ints but the arrays are small (like 12 items)

Are you specifying document IDs externally or allowing Elasticsearch to assign them?

Externally. The IDs are precomputed hashes of some of the properties (ex: aQtOVNq8ZF7T98Su+7mwY/ddgzo)

Do you perform updates and/or deletes?

We currently do not update or delete, we only index (full document overwrite) but will eventually likely move to an update.

Are you using any more advanced features that could impact performance, e.g. parent-child, nested documents, ingest pipelines or vectors? If so, does this differ across the indices?

None on these indexes. We do have some pipelines on some other as-of-yet unused indexes for geoip stuff.

Can you show the output of _cat/indices for the 11 indices in question? It would be good to understand the average shard size per index.

If you are specifying external IDs each indexing operation has to be treated as a potential update, so Elasticsearch has to check if the document exists before it can index it. This adds a read and thereby overhead. I have previously seen this make indexing throughput do down as shard sizes grow (which is why I asked about this).

Thank you for your help!

I forgot to mention in the original post that this is a manually generated clone of the elastic cloud version we have. I took the settings from our elastic cloud v7.15 cluster and basically just copied and pasted. Same hardware profile, same number of cluster members, etc. And the index settings are all the same... not sure if that helps or hurts, but the elastic cloud version didn't have this delta in indexing speed among the similar indexes.

How are you measuring the indexing rate? Also, the 15k docs/s is per indice in the nine fast indices or is overall?

I'm just looking at the rates in kibana (ex: https://kibana.local/app/monitoring#/elasticsearch/indices) and the colum "Index Rate" says a number like 2,556.56 /s and I thought I read somewhere in the documentation that it was documents per second. Kibana also has some other graphs and stuff I've glanced at but not too closely.
Additionally, I look at the number of seconds it takes for the batcher I wrote to return from the batch call. I consistently send it 50k docs worth of data (so 100k NLDJSON lines) and the fast indexes return in 6-8sec, the slow one is 30-60sec to return.

Also, what are the number of shards? Can you run a GET _cat/shards?v and share the results?

Total number of all shards is 216. However, most of those are kibana or other builtins. Of these indexes there are 48 shards. The sanitized results of the command is:

index                                                         shard prirep state        docs   store dataset ip          node
v9_idx_b                                                      0     p      STARTED  45053176    13gb    13gb 10.10.10.68 elastic2
v9_idx_b                                                      0     r      STARTED  45053176    13gb    13gb 10.10.10.69 elastic1
v9_idx_b                                                      1     p      STARTED  45041236  13.3gb  13.3gb 10.10.10.68 elastic2
v9_idx_b                                                      1     r      STARTED  45041236  13.3gb  13.3gb 10.10.10.69 elastic1
v9_idx_d                                                      0     p      STARTED  37964518   9.7gb   9.7gb 10.10.10.68 elastic2
v9_idx_d                                                      0     r      STARTED  37964518   9.7gb   9.7gb 10.10.10.69 elastic1
v9_idx_d                                                      1     p      STARTED  37968037  12.3gb  12.3gb 10.10.10.68 elastic2
v9_idx_d                                                      1     r      STARTED  37968037  12.3gb  12.3gb 10.10.10.69 elastic1
v9_idx_e                                                      0     p      STARTED  33196640   9.6gb   9.6gb 10.10.10.68 elastic2
v9_idx_e                                                      0     r      STARTED  33196640   9.6gb   9.6gb 10.10.10.69 elastic1
v9_idx_e                                                      1     p      STARTED  33205424   8.8gb   8.8gb 10.10.10.68 elastic2
v9_idx_e                                                      1     r      STARTED  33205424   8.8gb   8.8gb 10.10.10.69 elastic1
v9_idx_a                                                      0     p      STARTED  12475961   3.3gb   3.3gb 10.10.10.68 elastic2
v9_idx_a                                                      0     r      STARTED  12475961   3.3gb   3.3gb 10.10.10.69 elastic1
v9_idx_a                                                      1     p      STARTED  12476921   2.7gb   2.7gb 10.10.10.68 elastic2
v9_idx_a                                                      1     r      STARTED  12476921   2.7gb   2.7gb 10.10.10.69 elastic1
v9_idx_c                                                      0     p      STARTED  72839795  22.7gb  22.7gb 10.10.10.68 elastic2
v9_idx_c                                                      0     r      STARTED  72839795  22.7gb  22.7gb 10.10.10.69 elastic1
v9_idx_c                                                      1     p      STARTED  72840260  21.4gb  21.4gb 10.10.10.68 elastic2
v9_idx_c                                                      1     r      STARTED  72840260  21.4gb  21.4gb 10.10.10.69 elastic1
v9_idx_g                                                      0     p      STARTED  45927791  13.4gb  13.4gb 10.10.10.68 elastic2
v9_idx_g                                                      0     r      STARTED  45927791  13.4gb  13.4gb 10.10.10.69 elastic1
v9_idx_g                                                      1     p      STARTED  45940109    13gb    13gb 10.10.10.68 elastic2
v9_idx_g                                                      1     r      STARTED  45940109    13gb    13gb 10.10.10.69 elastic1
v9_idx_k                                                      0     p      STARTED 173813275  55.1gb  55.1gb 10.10.10.68 elastic2
v9_idx_k                                                      0     r      STARTED 173813275  56.4gb  56.4gb 10.10.10.69 elastic1
v9_idx_k                                                      1     p      STARTED 173830399  54.9gb  54.9gb 10.10.10.68 elastic2
v9_idx_k                                                      1     r      STARTED 173830399  55.1gb  55.1gb 10.10.10.69 elastic1
v9_idx_k                                                      2     p      STARTED 173803360    55gb    55gb 10.10.10.68 elastic2
v9_idx_k                                                      2     r      STARTED 173803360  54.9gb  54.9gb 10.10.10.69 elastic1
v9_idx_k                                                      3     p      STARTED 173826727  56.5gb  56.5gb 10.10.10.68 elastic2
v9_idx_k                                                      3     r      STARTED 173826727    55gb    55gb 10.10.10.69 elastic1
v9_idx_j                                                      0     p      STARTED  67740660  21.4gb  21.4gb 10.10.10.68 elastic2
v9_idx_j                                                      0     r      STARTED  67740660  21.4gb  21.4gb 10.10.10.69 elastic1
v9_idx_j                                                      1     p      STARTED  67756690  21.3gb  21.3gb 10.10.10.68 elastic2
v9_idx_j                                                      1     r      STARTED  67756690  21.3gb  21.3gb 10.10.10.69 elastic1
v9_idx_h                                                      0     p      STARTED  51697542    15gb    15gb 10.10.10.68 elastic2
v9_idx_h                                                      0     r      STARTED  51697542    15gb    15gb 10.10.10.69 elastic1
v9_idx_h                                                      1     p      STARTED  51711149  16.2gb  16.2gb 10.10.10.68 elastic2
v9_idx_h                                                      1     r      STARTED  51711149  16.2gb  16.2gb 10.10.10.69 elastic1
v9_idx_i                                                      0     p      STARTED  18725000   4.6gb   4.6gb 10.10.10.68 elastic2
v9_idx_i                                                      0     r      STARTED  18725000   4.6gb   4.6gb 10.10.10.69 elastic1
v9_idx_i                                                      1     p      STARTED  18722978   5.2gb   5.2gb 10.10.10.68 elastic2
v9_idx_i                                                      1     r      STARTED  18722978   5.2gb   5.2gb 10.10.10.69 elastic1
v9_idx_f                                                      0     p      STARTED  61429500  19.8gb  19.8gb 10.10.10.68 elastic2
v9_idx_f                                                      0     r      STARTED  61429500  19.8gb  19.8gb 10.10.10.69 elastic1
v9_idx_f                                                      1     p      STARTED  61435002  19.1gb  19.1gb 10.10.10.68 elastic2
v9_idx_f                                                      1     r      STARTED  61435002  19.1gb  19.1gb 10.10.10.69 elastic1

Note: I had before said 10 total indexes, its really 11, 10 are fast v9_idx_a-v9_idx_j, the 11th is slow v9_idx_k.

If you need all the other stuff I can provide it, just seemed like a waste. Again, we are basically only using these indexes, no other logging or anything.

The slow index has significantly larger shards than the others so I suspect the use of external IDs could indeed be the reason. If this is expected to be larger and hold more data than the others you may want to split it into a larger number of primary shards and see if that makes a difference.

...the output of _cat/indices for the 11 indices in question?

health status index    uuid               pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   v9_idx_a OaVhuS94TtlEs1Nw   2   1   24952882      2503366       12gb            6gb          6gb
green  open   v9_idx_c YBAV3nxGTr82B08Q   2   1  145680055     28004239     88.5gb         44.2gb       44.2gb
green  open   v9_idx_b rvCsyy55S4DVX52g   2   1   90094412     17771232     52.8gb         26.4gb       26.4gb
green  open   v9_idx_k EHJ5qr98QX0UbkKw   4   1  695273772    160659107    444.4gb        222.3gb      222.3gb
green  open   v9_idx_d Z8kuD7HQT7ZK04SQ   2   1   75932555     14662209     44.3gb         22.1gb       22.1gb
green  open   v9_idx_g xxwtQy7pQ7CpKDeg   2   1   91867900     20004465     52.9gb         26.4gb       26.4gb
green  open   v9_idx_i C52JTU0ZQ8fzsvhA   2   1   37447978      4509242     19.8gb          9.9gb        9.9gb
green  open   v9_idx_e 2jitPuMjQxdDyPIw   2   1   66402064     11559902     37.1gb         18.5gb       18.5gb
green  open   v9_idx_f g0UrNxzdT0YwIlgQ   2   1  122864502     23323189     77.9gb         38.9gb       38.9gb
green  open   v9_idx_h QDhvXL1QTMETT4ug   2   1  103408691     11705503     62.6gb         31.3gb       31.3gb
green  open   v9_idx_j Wk251dLDSWCsmE8g   2   1  135497350     32380220     85.6gb         42.8gb       42.8gb

Do I need to care about sanitizing the uuids? I truncated them in the above output, just not sure how much of this kinda stuff is dangerous to publish.

If you are specifying external IDs each indexing operation has to be treated as a potential update, so Elasticsearch has to check if the document exists before it can index it. This adds a read and thereby overhead.

I see the concern. The interesting part is the contrast with this 8.15 cluster and our elastic cloud 7.15 cluster which has the exact same data in it... The indexing times are consistent across indexes there and I know cloud vs onprem isn't really apples to apples but at least might give an understanding as to the index speed overtime. The doc count in cloud is 2.5B and on prem is 1.6B.

Also, if we do switch to an update problem it seems like we will basically be "paying" the same overhead as just an overwrite. So I guess I'll keep that in mind.

Not dangerous at all.

Indexing with an external ID can result in significantly higher disk I/O, so if you have slower storage in your on-prem cluster that could explain the difference.

The increase size is for sure expected.

I came across some recommendations to reduce the redundancy to 0 (reducing the number of shard copies) during indexing, then but it back to 1 (or whatever) when done. That seemed kinda risky, but the previous tech who was running the elastic cloud cluster did reduce it... but never made it redundant again for the one big index. So maybe that could by why we don't see this on the cloud cluster.

you may want to split it into a larger number of primary shards and see if that makes a difference.

I'll have to read up on how to do this. But that leads me to some questions:

  1. will this likely alleviate the issue somewhat or alltogether?
  2. can shard splitting be reversed? if yes, would I want to?
  3. what disadvantages, if any, come with splitting the shards?
  4. based on the other index sizes (the next largest is half the size) can/should we have some policy to split at a size like 25gb?

If there is a resource that holds this kinda of info please feel free to just point me there instead of having to answer yourself <3

Thank you again.

I don't think there is a difference in the IOPS from the cloud instances to our onprem. When I speced out the hosts I took the cloud instance specs as the guide and they advertised speeds similar to PCIe 3.0 NVMe drives so I got a PCIE 4.0 drives to future proof ourselves.

But I will keep this in mind moving forward. Thank you.

If the shard size is the issue it should help.

Yes (see shrink index API), but I do not see why you would do that, assuming you do not increase shard count excessively.

I do not see any disadvantages if you increase the primary shard count by a factor of 2 or 4 as your shard size would still be reasonably large

If you can estimate the data volumes ahead of time it is better to right-size the shards at the start instead of splitting (creates a new index and requires index to be read-only).

If you have local SSDs you should be fine. I have however seen many users hook up to networked SSD based storage claim they have SSDs just to be limited by network latency/throughput.

I will go down the shard splitting path after I educate my self a bit more on it... and

If you can estimate the data volumes ahead of time it is better to right-size the shards at the start instead of splitting (creates a new index and requires index to be read-only).

I'm not sure this is possible for this data but I'll think harder on this. Since what I'm indexing now represents about a year of business data this could be a good representation. I'm not sure if we want to keep more than 1yr worth, if we want to stay around this length of retention then I assume we can use some life-cycle policies or something to delete or age out old data?

If you have local SSDs you should be fine.

Yup, an onboard NVMe M.2 SSD. It's not redundant I'm relying on the elastic clustering for that. We are making snapshots and all so I think it's all super good enough. And keeping up with the IOPS I think is worth a little risk if HA is in place.
Should we consider a 3rd data node?
Oh, if we did add an additional data node as is (no shard splitting) would we just be making our performance problem worse because now we would have an additional copy of all the shard?

Thank you again, I won't ask any more shard questions until I read up on it more. Thank you!

ILM works by deleting complete indices so is not able to delete select data from within an index. To use ILM you would need to change your indexing strategy and use time-based indices or rollover, which would complicate updates and deletes. For you to delete data from within an index you need to set up and schedule delete-by-query requests externally.

Your big index in the cloud cluster does not have replicas?

Are you running on Linux? What is the file system for the data disk of Elasticsearch? ext4 or xfs?

Your big index in the cloud cluster does not have replicas?

Correct, no replicas on the big index.
It might be acceptable to trade performance for redundancy then insulate from risk with snapshots and things but my SysAdmin roots tells me I should allow for replication.
The data is updated once a month (a 2-3day firehose of data in the cloud) then basically just gets searched.
I wrote my own transformer in java which just takes a CSV/TSV format and builds batch requests to the endpoint. I send the requests to a single node instead of round-robin between them, so maybe there are ways to improve, but given the 11 indexes with only one being much different in size and performance I kinda wanna solve this issue first before trying something drastically different.
Unless you think there would be a drastic benefit to round-robining the bulk requests across both data nodes or creating a transformer, etc.

Are you running on Linux? What is the file system for the data disk of Elasticsearch? ext4 or xfs?

Overview:
Two physical AMD Ryzen 5 (12 threads) with 128GB RAM (32GBx4) based hosts running Ubuntu 24.04 server and LXD virtualization standalone (no LXD clustering) installed to a RAID1 pair of SATA SSDs formatted ext4.
Elastic nodes are Ubuntu 24.04 LXC containers running in an LXD data pool backed by a single PCIe 4.0 M.2 NVMe SSD formatted with BTRFS.

from the node's OS they see

#elastic1 on host srv1
/dev/nvme0n1p3 on / type btrfs (rw,relatime,idmapped,ssd,discard=async,space_cache=v2,user_subvol_rm_allowed,subvolid=325,subvol=/containers/elastic1)
#elastic-tiebreaker on host srv1
/dev/nvme0n1p3 on / type btrfs (rw,relatime,idmapped,ssd,discard=async,space_cache=v2,user_subvol_rm_allowed,subvolid=330,subvol=/containers/elastic-tiebreaker)
#elastic2 on host srv2
/dev/disk/by-uuid/84458d68-1fcc-43bb-ab63-fcb836f383d0 on / type btrfs (rw,relatime,idmapped,ssd,discard=async,space_cache=v2,subvolid=276,subvol=/containers/elastic2)

I'll admit I'm not a BTRFS master, but it seems to be performing admirably in IOPS testing.

We started with LXD to allow for easier configuration management (terraform and ansible) and because we had a single host as a POC box.
Side note: Thats right, all three elastic nodes on the same hardware which actually worked great until be added some additional services (MySQL hosts and some java runners for our business code. We finally started seeing NVMe drive controller heat issues causing throttling and crashing when the box was indexing AND running additional loads.

We bought a 2nd physical host once the search speeds were proved acceptable and now the elastic cluster is spread across the two physical hosts similarly configured and none of the other services are running, only elasticsearch and kibana.

That was more than you asked for but maybe will help better understand the situation <3

Thank you again for your time and help.

Yeah, I do not use btrfs, but sometimes the IOPS testing do not reflect the real usage.

The main reason that I asked is because some mounting options can improve the performance, mostly changing from relatime to noatime.

I've improved the performance in some clusters just by changing this mounting option, they were ext4 file systems, but from what I looked the same thing applies to btrfs.

Using noatime it is also better for SSDs as there will be less writes.

I would changing the mount from relatime to noatime, but I recommend you to look about it first.

1 Like

some mounting options can improve the performance, mostly changing from relatime to noatime .

Oh man, thank you for the good catch. I totally forgot.

I have managed to remount most of the filesystems with noatime but trying to get the rootfs mount options changed in LXD containers seems to be harder than I thought. So this test will have to wait. In the process I've kinda managed to bork some things a bit. So I'm gonna circle back on the file system mounts some other time.

BTW, I was able to set an LXC setting for the pool

ivan@srv2:~$ lxc storage show default
name: default
description: ""
driver: btrfs
status: Created
config:
  btrfs.mount_options: noatime,ssd
  source: 84458d68-1fcc-43bb-ab63-fcb836f383d0
  volatile.initial_source: /dev/nvme0n1
used_by:
- /1.0/instances/elastic2

But when I login to the container and check it's

root@elastic2:~# mount | grep btrfs
/dev/disk/by-uuid/84458d68-1fcc-43bb-ab63-fcb836f383d0 on / type btrfs (rw,relatime,idmapped,ssd,discard=async,space_cache=v2,subvolid=276,subvol=/containers/elastic2)
/fake_hdd.img on /hdd type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)
/dev/md0p2 on /ssd type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

The host machine has the file system mounts the drive noatime

/dev/nvme0n1 on /data type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

But that is done via /etc/fstab and my Ubuntu 24.04 minimal containers don't seem to respect what I put in /etc/fstab.

After doing as much reading as I can it seems splitting a shard is meant to benefit search performance. One article anecdotally cautioned indexing performance might suffer after increasing the number of primary shards. It was unclear if they were simply cautioning the read to not overdue it.

First of all, is "splitting a shard" done with the Split an Index command?

Splitting is basically creating a new index with a new number of primary shards and hard linking the files internally for increased splitting speed. This is most ideal because recreating this dataset from source data has taken over a month of ingesting so far. I'm aware splitting must be done on a read-only index which is totally doable at this point.

Secondly, rollback or undoing the split seems to be the Shrink an Index command... I will only move forward if I have a solid understand of the path ahead and how to walk back.

Additionally I have offsite snapshots, can the entire index be restored from snapshots after a split?

If yes, can I simply Restore a Snapshot putting it back to the way it is now?

If I have enough local disk space should I first add a local filesystem repository and snap to that to likely reduce the restore time?

How long should I expect a restore of a 500GB index to take from local SATA SSD to NVMe SSD, do I expect the same speed as it took for the original indexing or is it more like copying files speed? Recall it's taken a month to index the data (still not done) which is 1TB total (500GB of real data configured with one replica shard) but if I cp some data between drives I can easily saturate the BUS speed.

Thank you for your patience and ongoing help.