Constant read timeout errors while indexing; non-master nodes are idle; shard sizes are mismatched?

We are running an Elasticsearch 9.1.2 cluster with four nodes that is displaying some erratic behavior when under indexing load. Initially when we started indexing data, all four nodes (each a separate Kubernetes pod) were actively indexing. However, for the last hour-plus, only the master node (green line) is very active CPU-wise, while the non-master nodes are idle.

In the process that is trying to perform the indexing, I am seeing near-constant read timeout exceptions like this:

Traceback (most recent call last):
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 463, in _make_request
    httplib_response = conn.getresponse()
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 1430, in getresponse
    response.begin()
  File "/usr/local/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 1251, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/ssl.py", line 1103, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.12/site-packages/elastic_transport/_node/_http_urllib3.py", line 167, in perform_request
    response = self.pool.urlopen(
               ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 802, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 527, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 716, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 470, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='flywheel-elasticsearch-es-http', port=9200): Read timed out. (read timeout=300)

We are using version 9.1.0 of the python-elasticsearch client. Our code that is causing the indexing activity is mostly calls to helpers.streaming_bulk containing up to 300 indexing actions each.

Some other info about the cluster that might be helpful:

Cat nodes output:

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
10.1.4.33            32          86  12    0.21    0.51     0.81 cdfhilmrstw -      flywheel-elasticsearch-es-default-1
10.1.20.71           56          90  37    1.64    1.54     1.55 cdfhilmrstw *      flywheel-elasticsearch-es-default-3
10.1.21.5            42          92  11    0.22    0.31     0.59 cdfhilmrstw -      flywheel-elasticsearch-es-default-0
10.1.2.109           60          97   6    0.32    0.29     0.25 cdfhilmrstw -      flywheel-elasticsearch-es-default-2

Cat shards output (why is one primary shard so much larger than the rest, when they all have about the same number of documents?):

index                                                         shard prirep state    docs  store dataset ip         node
.ds-.logs-elasticsearch.deprecation-default-2025.09.03-000001 0     r      STARTED     1 10.6kb  10.6kb 10.1.2.109 flywheel-elasticsearch-es-default-2
.ds-.logs-elasticsearch.deprecation-default-2025.09.03-000001 0     p      STARTED     1 10.6kb  10.6kb 10.1.4.33  flywheel-elasticsearch-es-default-1
.ds-ilm-history-7-2025.09.03-000001                           0     r      STARTED     3  9.8kb   9.8kb 10.1.2.109 flywheel-elasticsearch-es-default-2
.ds-ilm-history-7-2025.09.03-000001                           0     p      STARTED     3  9.8kb   9.8kb 10.1.21.5  flywheel-elasticsearch-es-default-0
.ds-ilm-history-7-2025.09.10-000002                           0     r      STARTED     0   249b    249b 10.1.20.71 flywheel-elasticsearch-es-default-3
.ds-ilm-history-7-2025.09.10-000002                           0     p      STARTED     0   249b    249b 10.1.21.5  flywheel-elasticsearch-es-default-0
data_explorer_fields                                          0     p      STARTED     0   249b    249b 10.1.21.5  flywheel-elasticsearch-es-default-0
.security-7                                                   0     p      STARTED    29   52kb    52kb 10.1.20.71 flywheel-elasticsearch-es-default-3
.security-7                                                   0     r      STARTED    29   52kb    52kb 10.1.4.33  flywheel-elasticsearch-es-default-1
data_explorer_search_20250916t170522z                         0     p      STARTED 30394  1.8gb   1.8gb 10.1.2.109 flywheel-elasticsearch-es-default-2
data_explorer_search_20250916t170522z                         1     p      STARTED 30701  1.6gb   1.6gb 10.1.21.5  flywheel-elasticsearch-es-default-0
data_explorer_search_20250916t170522z                         2     p      STARTED 30471  1.6gb   1.6gb 10.1.20.71 flywheel-elasticsearch-es-default-3
data_explorer_search_20250916t170522z                         3     p      STARTED 30963  1.6gb   1.6gb 10.1.4.33  flywheel-elasticsearch-es-default-1
data_explorer_search_20250916t170522z                         4     p      STARTED 30639  1.6gb   1.6gb 10.1.2.109 flywheel-elasticsearch-es-default-2
data_explorer_search_20250916t170522z                         5     p      STARTED 30728  1.6gb   1.6gb 10.1.21.5  flywheel-elasticsearch-es-default-0
data_explorer_search_20250916t170522z                         6     p      STARTED 30507 11.5gb  11.5gb 10.1.20.71 flywheel-elasticsearch-es-default-3
data_explorer_search_20250916t170522z                         7     p      STARTED 30715  1.6gb   1.6gb 10.1.4.33  flywheel-elasticsearch-es-default-1
data_explorer_search_20250916t170522z                         8     p      STARTED 30558  1.6gb   1.6gb 10.1.21.5  flywheel-elasticsearch-es-default-0
data_explorer_search_20250916t170522z                         9     p      STARTED 30418  1.6gb   1.6gb 10.1.2.109 flywheel-elasticsearch-es-default-2
data_explorer_search_20250916t170522z                         10    p      STARTED 30940  1.6gb   1.6gb 10.1.20.71 flywheel-elasticsearch-es-default-3
data_explorer_search_20250916t170522z                         11    p      STARTED 30420  1.6gb   1.6gb 10.1.4.33  flywheel-elasticsearch-es-default-1

Hot threads info:

::: {flywheel-elasticsearch-es-default-0}{mC6Ro8vATY2ZD9yrU0siRg}{o5B-wOyFSPWOMFREtEfQwQ}{flywheel-elasticsearch-es-default-0}{10.1.21.5}{10.1.21.5:9300}{cdfhilmrstw}{9.1.2}{8000099-9033000}{ml.allocated_processors_double=4.0, ml.machine_memory=12884901888, ml.config_version=12.0.0, ml.max_jvm_size=6442450944, ml.allocated_processors=4, xpack.installed=true, transform.config_version=10.0.0, k8s_node_name=gke-gcp-qa-cluster-gcp-qa-node-pool-7bd16219-n9sv}
   Hot threads at 2025-09-16T20:47:19.479Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {flywheel-elasticsearch-es-default-1}{-HSrS5F_T_OxB7QG_OSNUA}{m1sxiKTsStu2cb9Dk2TxaQ}{flywheel-elasticsearch-es-default-1}{10.1.4.33}{10.1.4.33:9300}{cdfhilmrstw}{9.1.2}{8000099-9033000}{ml.allocated_processors=4, ml.max_jvm_size=6442450944, ml.config_version=12.0.0, ml.machine_memory=12884901888, ml.allocated_processors_double=4.0, transform.config_version=10.0.0, xpack.installed=true, k8s_node_name=gke-gcp-qa-cluster-gcp-qa-node-pool-7bd16219-2yce}
   Hot threads at 2025-09-16T20:47:19.480Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.0% [cpu=0.0%, other=100.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[readiness-service]'
     10/10 snapshots sharing following 7 elements
       java.base@24/sun.nio.ch.Net.accept(Native Method)
       java.base@24/sun.nio.ch.ServerSocketChannelImpl.implAccept(ServerSocketChannelImpl.java:424)
       java.base@24/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:391)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.readiness.ReadinessService.lambda$startListener$0(ReadinessService.java:176)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.readiness.ReadinessService$$Lambda/0x00000000a74c20c0.run(Unknown Source)
       java.base@24/java.lang.Thread.runWith(Thread.java:1460)
       java.base@24/java.lang.Thread.run(Thread.java:1447)

::: {flywheel-elasticsearch-es-default-2}{HQkFKLjjQESUgxT6VubUWw}{6acuA91ZRP2WHA4j5D08IA}{flywheel-elasticsearch-es-default-2}{10.1.2.109}{10.1.2.109:9300}{cdfhilmrstw}{9.1.2}{8000099-9033000}{ml.allocated_processors=4, ml.max_jvm_size=6442450944, ml.config_version=12.0.0, ml.machine_memory=12884901888, ml.allocated_processors_double=4.0, transform.config_version=10.0.0, xpack.installed=true, k8s_node_name=gke-gcp-qa-cluster-gcp-qa-node-pool-7bd16219-icfi}
   Hot threads at 2025-09-16T20:47:19.481Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.0% [cpu=0.1%, other=99.9%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[readiness-service]'
     10/10 snapshots sharing following 7 elements
       java.base@24/sun.nio.ch.Net.accept(Native Method)
       java.base@24/sun.nio.ch.ServerSocketChannelImpl.implAccept(ServerSocketChannelImpl.java:424)
       java.base@24/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:391)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.readiness.ReadinessService.lambda$startListener$0(ReadinessService.java:176)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.readiness.ReadinessService$$Lambda/0x000000006f4be000.run(Unknown Source)
       java.base@24/java.lang.Thread.runWith(Thread.java:1460)
       java.base@24/java.lang.Thread.run(Thread.java:1447)

::: {flywheel-elasticsearch-es-default-3}{DkwriEhsSHa91NYAm75lUg}{ekY7bC8uT9aHHcUtj717dA}{flywheel-elasticsearch-es-default-3}{10.1.20.71}{10.1.20.71:9300}{cdfhilmrstw}{9.1.2}{8000099-9033000}{ml.allocated_processors=4, ml.max_jvm_size=6442450944, ml.config_version=12.0.0, ml.machine_memory=12884901888, ml.allocated_processors_double=4.0, transform.config_version=10.0.0, xpack.installed=true, k8s_node_name=gke-gcp-qa-cluster-gcp-qa-node-pool-7bd16219-2pw9}
   Hot threads at 2025-09-16T20:47:19.480Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.3% [cpu=100.3%, other=0.0%] (501.4ms out of 500ms) cpu usage by thread 'elasticsearch[flywheel-elasticsearch-es-default-3][flush][T#17]'
     10/10 snapshots sharing following 40 elements
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.BytesRefHash$1$1.buildHistogram(BytesRefHash.java:170)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogramPart2(MSBRadixSorter.java:267)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogramPart1(MSBRadixSorter.java:257)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram(MSBRadixSorter.java:220)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.radixSort(MSBRadixSorter.java:157)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:136)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.MSBRadixSorter.sort(MSBRadixSorter.java:129)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.StringSorter.sort(StringSorter.java:54)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.util.BytesRefHash.sort(BytesRefHash.java:216)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.TermsHashPerField.sortTerms(TermsHashPerField.java:114)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:97)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:342)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:456)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:502)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:456)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:649)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:579)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:382)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:356)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:346)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:49)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.engine.ElasticsearchReaderManager.refreshIfNeeded(ElasticsearchReaderManager.java:28)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
       app/org.apache.lucene.core@10.2.2/org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:2072)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.engine.InternalEngine.flushHoldingLock(InternalEngine.java:2277)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.engine.Engine.flush(Engine.java:1423)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.shard.IndexShard.lambda$flush$12(IndexShard.java:1536)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.shard.IndexShard$$Lambda/0x000000003d542940.accept(Unknown Source)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.action.ActionListener.run(ActionListener.java:465)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1528)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.shard.IndexShard.lambda$afterWriteOperation$51(IndexShard.java:4151)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.index.shard.IndexShard$$Lambda/0x000000003d5819d8.run(Unknown Source)
       app/org.elasticsearch.server@9.1.2/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:977)
       java.base@24/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095)
       java.base@24/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619)
       java.base@24/java.lang.Thread.runWith(Thread.java:1460)
       java.base@24/java.lang.Thread.run(Thread.java:1447)

I have been seeing this issue for a long time and had hoped that upgrading to ES 9 would help alleviate it, but this seems to not be the case. Please let me know if I should provide any other info to help diagnose this. Thanks in advance!