Index creation slows down over time

We have a ruby test suite that creates and deletes our indices between tests. Under elasticsearch 5.6 this runs consistently very fast. Under 6.8 it degrades over time. And under 7.6 it is slow off the bat. I made a generic benchmark and ran it against the three versions listed above

require 'benchmark/ips'
require 'elasticsearch'

client = Elasticsearch::Client.new(url: 'http://localhost:9200')

index_configs = (1..20).map do |i|
  alias_name = "test_index_#{i}"
  name = "test_#{i}_index_#{i}"
  config = {
    aliases: { alias_name => {} },
    settings: {
      number_of_shards: 1,
      number_of_replicas: 0
    }
  }

  {name: name, body: config}
end

def delete_my_indices!
  client = Elasticsearch::Client.new(url: 'http://localhost:9200')
  all_indices = client.indices.get_alias().keys
  my_indices = all_indices.grep(/test_/)
  client.indices.delete(index: my_indices) if my_indices.any?
end


Benchmark.ips do |x|
  # Configure the number of seconds used during
  # the warmup phase (default 2) and calculation phase (default 5)
  x.config(:time => 15, :warmup => 5)

  x.report("create 20 indices") do
    delete_my_indices!
    index_configs.each do |conf|
      client.indices.create(index: conf[:name], body: conf[:body])
    end
  end

  # Compare the iterations per second of the various reports!
  x.compare!
end
delete_my_indices!

Here are the best results for each version (I'm using a MacbookPro with a 2.6 GHz 6-Core Intel Core i7 and 32 GB 2667 MHz DDR4 of ram)

the best run for es6 was
create 20 indices 1.653 (± 0.0%) i/s - 25.000 in 15.229568s
the best run for es7 was
create 20 indices 0.200 (± 0.0%) i/s - 3.000 in 15.014049s
and for es 5:
create 20 indices 2.921 (± 0.0%) i/s - 44.000 in 15.092491s

I'm using the defaults for everything except I set -Xms4g -Xmx4g. I also set-XX:NewRatio=2 for es5 and es6 (it made no difference for es7)

I'm not very good at reading the node stats to see where things are dragging out in 6.x and 7.x. I'd be much obliged if anyone has any suggestions as to what I should try next.

Hello @astevenson

This is interesting and I have few questions:

  • Are you using the same Elasticsearch client for all the 3 Elasticsearch nodes/clusters?
  • Are the 3 Elasticsearch nodes pointing to the same disk for storing the data (path.data)?
  • Are the 3 Elasticsearch nodes the associated jvm.options (except the JVM Heap size) which come bundled with them?
  • Would it be possible to make the delete_my_indices reuse the client you've already opened at the beginning?
  • Would it be possible to run the benchmark multiple times (to get the variations)?
  • Which HTTP library are you using?

To investigate, I would first check if there are error or warning messages in the logs, or any message regarding the GC duration.

The ruby elasticsearch gem version 7.6.0, but I've seen the performance issue using the 6.x client against version 6.8 etc.

Yes, I run each es version by itself and reset the node dirs and etc/elasticearch before switching versions. One of our sys-admins had similar results using docker

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.2
docker run --rm -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "bootstrap.memory_lock=true" --ulimit "memlock=-1:-1" -e ES_JAVA_OPTS="-Xms4g -Xmx4g -Xlog:disable" docker.elastic.co/elasticsearch/elasticsearch:7.6.2
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.8.8
docker run --rm -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "bootstrap.memory_lock=true" --ulimit "memlock=-1:-1" -e ES_JAVA_OPTS="-Xms4g -Xmx4g -Xlog:disable" docker.elastic.co/elasticsearch/elasticsearch:6.8.8
docker pull docker.elastic.co/elasticsearch/elasticsearch:5.6.16
docker run --rm -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "bootstrap.memory_lock=true" -e "xpack.security.enabled=false" --ulimit "memlock=-1:-1" -e ES_JAVA_OPTS="-Xms4g -Xmx4g" docker.elastic.co/elasticsearch/elasticsearch:5.6.16

Yes

Yes, but it makes no difference and we've done some other profiling to figure out the delete is always cheap.

Whoops. I got distracted by how bad the es7 performance was. I'll get some numbers from multiple runs to show the slowdown on 6.8 and post below.

We've seen this behavior in our production app's test suite with http_persistent, typhoeus and patron. Would it help to redo this as a bash script with curl?

I am not doubting about your findings, I wanted to exclude some possible pitfalls.

Thank you for your help digging into this

Thank you for jumping in.

Per a suggestion from one of our sysadmins, I added a variant that uses wait_for_active_shards: 0 and I changed the delete command to use the persisted client. I'm also doing 30s runs now to get more meaningful timings.

Here are the results for 5.6.16.

Docker command:

docker run --rm -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "bootstrap.memory_lock=true" -e "xpack.security.enabled=false" --ulimit "memlock=-1:-1" -e ES_JAVA_OPTS="-Xms4g -Xmx4g" docker.elastic.co/elasticsearch/elasticsearch:5.6.16

Results of three runs

First:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      1.965  (± 0.0%) i/s -     59.000  in  30.475732s
   20 indices wait:0      2.452  (± 0.0%) i/s -     74.000  in  30.352599s

Comparison:
   20 indices wait:0:        2.5 i/s
   create 20 indices:        2.0 i/s - 1.25x  slower

Second:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      2.205  (± 0.0%) i/s -     66.000  in  30.036526s
   20 indices wait:0      2.393  (± 0.0%) i/s -     72.000  in  30.230190s

Comparison:
   20 indices wait:0:        2.4 i/s
   create 20 indices:        2.2 i/s - 1.09x  slower

Third:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      2.228  (± 0.0%) i/s -     67.000  in  30.178549s
   20 indices wait:0      2.540  (± 0.0%) i/s -     76.000  in  30.111070s

Comparison:
   20 indices wait:0:        2.5 i/s
   create 20 indices:        2.2 i/s - 1.14x  slower

Looking across runs both variations can create and delete 20 indices about 2 - 2.5 times a second and it stays fast for all three runs.

Here's 6.8.8

First run:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      1.018  (± 0.0%) i/s -     31.000  in  30.506013s
   20 indices wait:0      1.489  (± 0.0%) i/s -     45.000  in  30.441973s

Comparison:
   20 indices wait:0:        1.5 i/s
   create 20 indices:        1.0 i/s - 1.46x  slower

Second run:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.979  (± 0.0%) i/s -     30.000  in  30.788137s
   20 indices wait:0      1.271  (± 0.0%) i/s -     39.000  in  30.784245s

Comparison:
   20 indices wait:0:        1.3 i/s
   create 20 indices:        1.0 i/s - 1.30x  slower

Third run:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.874  (± 0.0%) i/s -     27.000  in  31.014928s
   20 indices wait:0      1.046  (± 0.0%) i/s -     32.000  in  30.689844s

Comparison:
   20 indices wait:0:        1.0 i/s
   create 20 indices:        0.9 i/s - 1.20x  slower

Each run is slightly slower than the last. On our internal test cluster we see this drag out to where each individual create request eventually takes over 1 second. We've worked around this by restarting our cluster every night, but it still means the first CI run of our test suite on any given day takes around 25 minutes and the last is closer to 40 depending on how many times we push.

Here's 7.6.2

First run:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.657  (± 0.0%) i/s -     20.000  in  30.473934s
   20 indices wait:0      0.916  (± 0.0%) i/s -     28.000  in  30.594544s

Comparison:
   20 indices wait:0:        0.9 i/s
   create 20 indices:        0.7 i/s - 1.39x  slower

Second:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.685  (± 0.0%) i/s -     21.000  in  30.717306s
   20 indices wait:0      0.888  (± 0.0%) i/s -     27.000  in  30.419167s

Comparison:
   20 indices wait:0:        0.9 i/s
   create 20 indices:        0.7 i/s - 1.30x  slower

Third:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.667  (± 0.0%) i/s -     20.000  in  30.033112s
   20 indices wait:0      0.831  (± 0.0%) i/s -     25.000  in  30.114476s

Comparison:
   20 indices wait:0:        0.8 i/s
   create 20 indices:        0.7 i/s - 1.25x  slower

Here the slowdown is less pronounced vs 6.8.8, but I wonder if that's because it creates so many fewer indices over time. When I restart the docker image and run it five times in a row instead of 3 I get

First:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.668  (± 0.0%) i/s -     20.000  in  30.046556s
   20 indices wait:0      0.925  (± 0.0%) i/s -     28.000  in  30.337861s

Comparison:
   20 indices wait:0:        0.9 i/s
   create 20 indices:        0.7 i/s - 1.38x  slower

Second:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.709  (± 0.0%) i/s -     22.000  in  31.067696s
   20 indices wait:0      0.909  (± 0.0%) i/s -     28.000  in  30.863562s

Comparison:
   20 indices wait:0:        0.9 i/s
   create 20 indices:        0.7 i/s - 1.28x  slower

Third:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.680  (± 0.0%) i/s -     21.000  in  30.910941s
   20 indices wait:0      0.869  (± 0.0%) i/s -     27.000  in  31.095478s

Comparison:
   20 indices wait:0:        0.9 i/s
   create 20 indices:        0.7 i/s - 1.28x  slower

Fourth:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.638  (± 0.0%) i/s -     20.000  in  31.436337s
   20 indices wait:0      0.790  (± 0.0%) i/s -     24.000  in  30.394412s

Comparison:
   20 indices wait:0:        0.8 i/s
   create 20 indices:        0.6 i/s - 1.24x  slower

Fifth:

Warming up --------------------------------------
   create 20 indices     1.000  i/100ms
   20 indices wait:0     1.000  i/100ms
Calculating -------------------------------------
   create 20 indices      0.605  (± 0.0%) i/s -     19.000  in  31.411018s
   20 indices wait:0      0.707  (± 0.0%) i/s -     22.000  in  31.182337s

Comparison:
   20 indices wait:0:        0.7 i/s
   create 20 indices:        0.6 i/s - 1.17x  slower

I plopped our example script into a project up on github:

I'm most interested in the effects in 7.6.2 since (1) index creation involves a lot of cluster state updates and 7.x does cluster state updates very differently from earlier versions, and also (2) it's unlikely we'll make any changes to 6.x versions in this area and certainly won't be doing anything in 5.x.

Can you set logger.org.elasticsearch.cluster.service: TRACE on all the nodes in your 7.6.2 cluster and re-run your 5x experiment, then provide all the logs from all the nodes. They will be quite copious so you'll need to share them outside of this forum; I suggest attaching them to an issue on https://github.com/APMG/es_troubleshooting_index_creation/issues but other options are available.

1 Like

5.x doesn't have an issue. This regression actually didn't show up right away when we moved to 6.x. It was somewhere around 6.6(?) that our test suite started dragging out so something in there introduced some behavior that gradually degrades index creation performance on a a running cluster over time.

I'll get on those debug numbers.

Would it be possible to have someone else run my test script above to see if they get similar results?

Sorry, I don't have the Ruby-fu to run it, but I did try a similar process in Bash:

#!/bin/bash

set -euo pipefail

curl -s -XDELETE 'http://localhost:9200/test_*' > /dev/null

while true
do
  start=$(date +%s%N)

  for i in $(seq 10 30)
  do
    curl -s -XPUT 'http://localhost:9200/test_'$i -H 'Content-type: application/json' --data-binary $'{"settings":{"number_of_shards":1,"number_of_replicas":0}}' > /dev/null
  done

  for i in $(seq 10 30)
  do
    curl -s -XDELETE 'http://localhost:9200/test_'$i > /dev/null
  done

  end=$(date +%s%N)

  duration=$((($end-$start)/1000000))
  echo "Duration: $duration ms"

done

It's not showing any appreciable trend:

$ ./test-script.sh
Duration: 6137 ms
Duration: 6323 ms
Duration: 6193 ms
Duration: 6058 ms
Duration: 5986 ms
Duration: 6163 ms
Duration: 6093 ms
Duration: 6178 ms
Duration: 5916 ms
Duration: 6227 ms
Duration: 6224 ms
Duration: 6132 ms
Duration: 5887 ms
Duration: 5823 ms
Duration: 6246 ms
Duration: 6381 ms
Duration: 6109 ms
Duration: 6406 ms
Duration: 6363 ms
Duration: 6377 ms
Duration: 6408 ms
Duration: 5969 ms
Duration: 6173 ms
Duration: 6238 ms
Duration: 6282 ms
Duration: 5569 ms
Duration: 5759 ms
Duration: 6118 ms
Duration: 5719 ms
Duration: 6000 ms
Duration: 6259 ms
Duration: 5811 ms
Duration: 6084 ms
Duration: 6483 ms
Duration: 6425 ms
Duration: 6232 ms
Duration: 5820 ms
Duration: 6329 ms
Duration: 6235 ms
Duration: 6333 ms
Duration: 6405 ms
Duration: 6263 ms
Duration: 5676 ms
Duration: 6004 ms
Duration: 5860 ms
Duration: 5633 ms
Duration: 6297 ms
Duration: 6378 ms
Duration: 6434 ms
Duration: 6551 ms
Duration: 5923 ms
Duration: 6242 ms
Duration: 6274 ms
Duration: 6443 ms
Duration: 6397 ms
Duration: 6320 ms
Duration: 6437 ms
Duration: 6258 ms
Duration: 6661 ms
Duration: 6256 ms
Duration: 6166 ms
Duration: 6041 ms
Duration: 6229 ms

FWIW: I don't think that script will hit the volume the ruby script does nearly as quickly. I'm trying to run it locally, but OS X doesn't have gnu date and doesn't support the %N format argument.

If you have a relatively recent ruby installed on your system you should only need to run the following three commands:

gem install bundler
bundle install
bundle exec ruby bench_es_index_mgmt.rb

Here's what I get against 6.8.8 if I let the bash script run for a long while:

[20-05-04 13:13] √ 👉 bash bench_indexing.sh
Duration: 1779 ms
Duration: 1758 ms
Duration: 1768 ms
Duration: 1679 ms
Duration: 1674 ms
Duration: 1688 ms
Duration: 1709 ms
Duration: 1714 ms
Duration: 1688 ms
Duration: 1824 ms
Duration: 1681 ms
Duration: 1800 ms
Duration: 1757 ms
Duration: 1706 ms
Duration: 1722 ms
Duration: 1726 ms
Duration: 1716 ms
Duration: 1722 ms
Duration: 1740 ms
Duration: 1705 ms
Duration: 1746 ms
Duration: 1699 ms
Duration: 1717 ms
Duration: 1698 ms
Duration: 1766 ms
Duration: 1687 ms
Duration: 1684 ms
Duration: 1705 ms
Duration: 1935 ms
Duration: 1927 ms
Duration: 1673 ms
Duration: 1763 ms
Duration: 2267 ms
Duration: 2266 ms
Duration: 2189 ms
Duration: 1898 ms
Duration: 1836 ms
Duration: 1965 ms
Duration: 1820 ms
Duration: 1813 ms
Duration: 1763 ms
Duration: 1709 ms
Duration: 1846 ms
Duration: 1757 ms
Duration: 1781 ms
Duration: 1802 ms
Duration: 1938 ms
Duration: 1742 ms
Duration: 1830 ms
Duration: 1875 ms
Duration: 2002 ms
Duration: 2003 ms
Duration: 1847 ms
Duration: 1993 ms
Duration: 1869 ms
Duration: 2007 ms
Duration: 1829 ms
Duration: 1786 ms
Duration: 1926 ms
Duration: 1877 ms
Duration: 1876 ms
Duration: 2118 ms
Duration: 1913 ms
Duration: 1978 ms
Duration: 1914 ms
Duration: 1890 ms
Duration: 1818 ms
Duration: 1901 ms
Duration: 2038 ms
Duration: 1863 ms
Duration: 1823 ms
Duration: 2024 ms
Duration: 1946 ms
Duration: 2753 ms
Duration: 2087 ms
Duration: 1883 ms
Duration: 2101 ms
Duration: 2152 ms
Duration: 1962 ms
Duration: 2075 ms
Duration: 1921 ms
Duration: 1867 ms
Duration: 2086 ms
Duration: 1995 ms
Duration: 1917 ms
Duration: 1913 ms
Duration: 1963 ms
Duration: 1989 ms
Duration: 2010 ms
Duration: 2073 ms
Duration: 2065 ms
Duration: 2142 ms
Duration: 2072 ms
Duration: 1974 ms
Duration: 1949 ms
Duration: 2011 ms
Duration: 1966 ms
Duration: 2098 ms
Duration: 2258 ms
Duration: 2048 ms
Duration: 2279 ms
Duration: 2160 ms
Duration: 2176 ms
Duration: 2093 ms
Duration: 2195 ms
Duration: 2083 ms
Duration: 1995 ms
Duration: 2092 ms
Duration: 2037 ms
Duration: 2030 ms
Duration: 2102 ms
Duration: 1960 ms
Duration: 1985 ms
Duration: 1993 ms
Duration: 2015 ms
Duration: 2071 ms
Duration: 1989 ms
Duration: 2142 ms
Duration: 1975 ms
Duration: 2068 ms
Duration: 2200 ms
Duration: 2017 ms
Duration: 2134 ms
Duration: 2001 ms
Duration: 2074 ms
Duration: 2026 ms
Duration: 2016 ms
Duration: 2040 ms
Duration: 2030 ms
Duration: 2057 ms
Duration: 2086 ms
Duration: 2178 ms
Duration: 2128 ms
Duration: 2042 ms
Duration: 2073 ms
Duration: 2077 ms
Duration: 2166 ms
Duration: 2103 ms
Duration: 2204 ms
Duration: 2253 ms
Duration: 2125 ms
Duration: 2152 ms
Duration: 2090 ms
Duration: 2199 ms
Duration: 2127 ms
Duration: 2077 ms
Duration: 2233 ms
Duration: 2150 ms
Duration: 2266 ms
Duration: 2178 ms
Duration: 2158 ms
Duration: 2102 ms
Duration: 2334 ms
Duration: 2254 ms
Duration: 2153 ms
Duration: 2249 ms
Duration: 2131 ms
Duration: 2137 ms
Duration: 2202 ms
Duration: 2492 ms
Duration: 2159 ms
Duration: 2186 ms
Duration: 2147 ms
Duration: 2232 ms
Duration: 2208 ms
Duration: 2281 ms
Duration: 2170 ms
Duration: 2158 ms
Duration: 2205 ms
Duration: 2233 ms
Duration: 2220 ms
Duration: 2202 ms
Duration: 2274 ms
Duration: 2191 ms
Duration: 2283 ms
Duration: 2382 ms
Duration: 2293 ms
Duration: 2231 ms
Duration: 2323 ms
Duration: 2244 ms
Duration: 2256 ms
Duration: 2310 ms
Duration: 2297 ms
Duration: 2288 ms
Duration: 2257 ms
Duration: 2426 ms
Duration: 2242 ms
Duration: 2347 ms
Duration: 2280 ms
Duration: 2294 ms
Duration: 2332 ms
Duration: 2362 ms
Duration: 2327 ms
Duration: 2345 ms
Duration: 2370 ms
Duration: 2341 ms
Duration: 2326 ms
Duration: 2387 ms
Duration: 2346 ms
Duration: 2370 ms
Duration: 2333 ms
Duration: 2480 ms
Duration: 2426 ms
Duration: 2414 ms
Duration: 2446 ms
Duration: 2466 ms
Duration: 2385 ms
Duration: 2430 ms
Duration: 2377 ms
Duration: 2502 ms
Duration: 2459 ms
Duration: 2540 ms
Duration: 2420 ms
Duration: 2485 ms
Duration: 2445 ms
Duration: 2492 ms
Duration: 2441 ms
Duration: 2476 ms
Duration: 2446 ms
Duration: 2461 ms
Duration: 2475 ms
Duration: 2483 ms
Duration: 2521 ms
Duration: 2484 ms
Duration: 2418 ms
Duration: 2564 ms
Duration: 2436 ms
Duration: 2515 ms
Duration: 2607 ms
Duration: 2627 ms
Duration: 2509 ms
Duration: 2564 ms
Duration: 2512 ms
Duration: 2561 ms
Duration: 2636 ms
Duration: 2609 ms
Duration: 2487 ms
Duration: 2557 ms
Duration: 2526 ms
Duration: 2556 ms
Duration: 2548 ms
Duration: 2619 ms
Duration: 2504 ms
Duration: 2588 ms
Duration: 2556 ms
Duration: 2611 ms
Duration: 2580 ms
Duration: 2577 ms
Duration: 2575 ms
Duration: 2650 ms
Duration: 2617 ms
Duration: 2565 ms
Duration: 2740 ms
Duration: 2598 ms
Duration: 2592 ms
Duration: 2645 ms
Duration: 2588 ms
Duration: 2624 ms
Duration: 2824 ms
Duration: 2702 ms
Duration: 2642 ms
Duration: 2622 ms
Duration: 2628 ms
Duration: 2626 ms
Duration: 2595 ms
Duration: 2646 ms
Duration: 2660 ms
Duration: 2668 ms
Duration: 2635 ms
Duration: 2713 ms
Duration: 2633 ms
Duration: 2631 ms
Duration: 2800 ms
Duration: 2677 ms
Duration: 2678 ms
Duration: 2667 ms
Duration: 2708 ms
Duration: 2627 ms
Duration: 2678 ms
Duration: 2677 ms
Duration: 2778 ms
Duration: 2700 ms
Duration: 2689 ms
Duration: 2661 ms
Duration: 2687 ms
Duration: 2789 ms
Duration: 2716 ms
Duration: 2687 ms
Duration: 2765 ms
Duration: 2750 ms
Duration: 2712 ms
Duration: 2767 ms
Duration: 2651 ms
Duration: 2801 ms
Duration: 2884 ms
Duration: 2733 ms
Duration: 2872 ms
Duration: 2724 ms
Duration: 2771 ms
Duration: 2740 ms
Duration: 2834 ms
Duration: 2728 ms
Duration: 3039 ms
Duration: 2821 ms
Duration: 2706 ms
Duration: 2759 ms
Duration: 2784 ms
Duration: 2802 ms
Duration: 2758 ms
Duration: 2726 ms
Duration: 2992 ms
Duration: 2934 ms
Duration: 2877 ms
Duration: 2855 ms
Duration: 2867 ms
Duration: 2972 ms
Duration: 2796 ms
Duration: 2855 ms
Duration: 2866 ms
Duration: 2806 ms
Duration: 2825 ms
Duration: 2836 ms
Duration: 2929 ms
Duration: 3075 ms
Duration: 2902 ms
Duration: 2999 ms
Duration: 2817 ms
Duration: 2840 ms
Duration: 2881 ms
Duration: 3010 ms
Duration: 2844 ms
Duration: 2956 ms
Duration: 3007 ms
Duration: 2984 ms
Duration: 2997 ms
Duration: 2989 ms
Duration: 3070 ms
Duration: 2959 ms
Duration: 3089 ms
Duration: 2959 ms
Duration: 3214 ms
Duration: 2992 ms
Duration: 3407 ms
Duration: 3402 ms
Duration: 3200 ms

I did slightly modify the original script to more closely match match the ruby script's behavior:

#!/bin/bash

set -euo pipefail

curl -s -XDELETE 'http://localhost:9200/test_*' > /dev/null

while true
do
  start=$(gdate +%s%N)

  for i in $(seq 10 30)
  do
    curl -s -XPUT 'http://localhost:9200/test_'$i -H 'Content-type: application/json' --data-binary $'{"settings":{"number_of_shards":1,"number_of_replicas":0}}' > /dev/null
  done

  curl -s -XDELETE 'http://localhost:9200/test_*' > /dev/null

  end=$(gdate +%s%N)

  duration=$((($end-$start)/1000000))
  echo "Duration: $duration ms"
done

Curious, ok, I left the script running since my previous reply and it's now consistently taking 10 seconds for each iteration whereas it started out at 6.

Could you grab the output of GET _nodes/hot_threads?threads=9999 a few times while you are running your tests and they're going slower than you'd like? I can see some interesting things in my hot threads and would like to check that you're seeing the same thing.

$ curl localhost:9200/_nodes/hot_threads?threads=9999
::: {63158d0a1bb6}{9tPJl94_S0ehqds7H2sT2Q}{D8YTz1TkReWziqTeohvfrw}{172.17.0.2}{172.17.0.2:9300}{dilm}{ml.machine_memory=8086286336, xpack.installed=true, ml.max_open_jobs=20}
   Hot threads at 2020-05-05T00:55:36.264Z, interval=500ms, busiestThreads=9999, ignoreIdleThreads=true:
   17.2% (86.1ms out of 500ms) cpu usage by thread 'elasticsearch[63158d0a1bb6][masterService#updateTask][T#1]'
     3/10 snapshots sharing following 20 elements
       java.base@13.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@13.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
       app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
       app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
       app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
       app//org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:272)
       app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250)
       app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
       app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
       app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
       app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    2.1% (10.4ms out of 500ms) cpu usage by thread 'elasticsearch[63158d0a1bb6][generic][T#2]'
     3/10 snapshots sharing following 23 elements
       app//org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
       app//org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:3131)
       app//org.elasticsearch.gateway.PersistedClusterStateService$MetaDataIndexWriter.prepareCommit(PersistedClusterStateService.java:526)
       app//org.elasticsearch.gateway.PersistedClusterStateService$Writer.commit(PersistedClusterStateService.java:737)
       app//org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:616)
       app//org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:536)
       app//org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:357)
       app//org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:332)
       app//org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2862/0x000000080148e440.apply(Unknown Source)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:447)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:430)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:100)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler$$Lambda$2865/0x000000080148d040.messageReceived(Unknown Source)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315)
       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       app//org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:762)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    0.4% (2.2ms out of 500ms) cpu usage by thread 'elasticsearch[63158d0a1bb6][generic][T#1]'
     7/10 snapshots sharing following 12 elements
       app//org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:96)
       app//org.elasticsearch.index.shard.StoreRecovery$$Lambda$4343/0x0000000801813040.get(Unknown Source)
       app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:285)
       app//org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:94)
       app//org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1866)
       app//org.elasticsearch.index.shard.IndexShard$$Lambda$4338/0x0000000801814c40.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
     3/10 snapshots sharing following 9 elements
       java.base@13.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@13.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:743)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1366)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    0.0% (43.2micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
     10/10 snapshots sharing following 2 elements
       java.base@13.0.2/java.lang.Thread.sleep(Native Method)
       org.elasticsearch.xpack.watcher.trigger.schedule.engine.TickerScheduleTriggerEngine$Ticker.run(TickerScheduleTriggerEngine.java:193)
    0.0% (0s out of 500ms) cpu usage by thread 'Common-Cleaner'
     10/10 snapshots sharing following 5 elements
       java.base@13.0.2/java.lang.Object.wait(Native Method)
       java.base@13.0.2/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
       java.base@13.0.2/jdk.internal.ref.CleanerImpl.run(CleanerImpl.java:148)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
       java.base@13.0.2/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:134)
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/7.6.2]'
     10/10 snapshots sharing following 8 elements
       java.base@13.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@13.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
       java.base@13.0.2/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
       app//org.elasticsearch.bootstrap.Bootstrap$1.run(Bootstrap.java:83)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    0.0% (0s out of 500ms) cpu usage by thread 'ml-cpp-log-tail-thread'
     10/10 snapshots sharing following 6 elements
       java.base@13.0.2/java.io.FileInputStream.readBytes(Native Method)
       java.base@13.0.2/java.io.FileInputStream.read(FileInputStream.java:251)
       org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler.tailStream(CppLogMessageHandler.java:105)
       org.elasticsearch.xpack.ml.process.NativeController.lambda$tailLogsInThread$0(NativeController.java:72)
       org.elasticsearch.xpack.ml.process.NativeController$$Lambda$2270/0x00000008012dc840.run(Unknown Source)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
    0.0% (0s out of 500ms) cpu usage by thread 'process reaper'
     10/10 snapshots sharing following 5 elements
       java.base@13.0.2/java.lang.ProcessHandleImpl.waitForProcessExit0(Native Method)
       java.base@13.0.2/java.lang.ProcessHandleImpl$1.run(ProcessHandleImpl.java:138)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    0.0% (0s out of 500ms) cpu usage by thread 'Connection evictor'
     10/10 snapshots sharing following 3 elements
       java.base@13.0.2/java.lang.Thread.sleep(Native Method)
       org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)

Hmm that all looks unsurprising to me. Maybe it'll take a few tries before you catch the problem, especially if it's only a few percent slower than when it started.

For reference the issue I found after running the script for several hours is https://github.com/elastic/elasticsearch/issues/56171 and there's a stack dump showing the problem on that issue.

One of our sys admins wanted me to pass this along:

$ curl localhost:9200/_nodes/hot_threads?threads=9999
::: {63158d0a1bb6}{9tPJl94_S0ehqds7H2sT2Q}{D8YTz1TkReWziqTeohvfrw}{172.17.0.2}{172.17.0.2:9300}{dilm}{ml.machine_memory=8086286336, xpack.installed=true, ml.max_open_jobs=20}
   Hot threads at 2020-05-05T15:17:45.765Z, interval=500ms, busiestThreads=9999, ignoreIdleThreads=true:
   81.2% (405.9ms out of 500ms) cpu usage by thread 'elasticsearch[63158d0a1bb6][masterService#updateTask][T#1]'
     10/10 snapshots sharing following 20 elements
       java.base@13.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@13.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
       java.base@13.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
       app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
       app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
       app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
       app//org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:272)
       app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250)
       app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
       app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
       app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
       app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
    1.1% (5.4ms out of 500ms) cpu usage by thread 'elasticsearch[63158d0a1bb6][generic][T#4]'
     2/10 snapshots sharing following 19 elements
       app//org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:616)
       app//org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:536)
       app//org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:357)
       app//org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:332)
       app//org.elasticsearch.cluster.coordination.Coordinator$$Lambda$2862/0x000000080148e440.apply(Unknown Source)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:447)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:430)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:100)
       app//org.elasticsearch.cluster.coordination.PublicationTransportHandler$$Lambda$2865/0x000000080148d040.messageReceived(Unknown Source)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315)
       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       app//org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:762)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)
     8/10 snapshots sharing following 9 elements
       java.base@13.0.2/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@13.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:743)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@13.0.2/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1366)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
       java.base@13.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.2/java.lang.Thread.run(Thread.java:830)

Did they say why? That looks pretty much the same as the previous one you shared, and I don't see anything surprising there.