Indexing performance degrading over time

Hi,

we are doing performance testing of various types of environments to decide on a properly sized set of nodes for a number of Elasticsearch Cluster installations.

When doing long-term indexing testing, we see a sharp drop of indexing performance in the beginning, stabilizing after some time. I.e. writing performance is up to 5 times higher initially compared to after 6 hours.

Typically after something like 6-8 hours performance becomes more or less stable and does not decrease a lot any more.

See the following screenshot, the blue bars are the number of documents written per time-frame, the initial performance is quite good, but quickly drops.

We mostly use Elasticsearch 2.1.1, and we see this pretty much across the board, from 1 node machines with 16GB RAM and 4 cores up to 6-node clusters with 32GB and 8 cores.

We did apply common performance optimizations mentioned in the documentation and we are using bulk indexing with 5k/5MB bulks usually.

When doing performance analysis, we see the system is mostly CPU bound, probably because the documents are large (1.5k bytes JSON aprox.) and include a few nested documents.

See this screenshot from dynaTrace: It shows the CPU usage on the left side and some Elasticsearch-specific metrics in the middle, there does not seem to be any throttling kicking in, networking is not an issue as this test even ran indexing from the same machine, read-operations are constant, writing operations go down with the number of documents that are written.

Is this something that is expected because of segment merging kicking in only after some time? However 6 hours seems a bit long when writing a few thousand documents per second.

Or do we run into some configuration-limit on indexing which throttles down the rate of indexing? Or Thread pools?

Thanks... Dominik.

3 Likes

Hmm. The CPU usage is almost certainly segment merging...it's effectively a streaming sort of the segments involved in the merge (plus various compression/decompression operations). So it slurps down a lot of CPU.

It's somewhat expected for indexing throughput to decline as the shard size increases, but you have a very dramatic decrease. It looks unusual to me (especially considering the throughput earlier in the run, and that you aren't bottlenecking on disk IO).

Some random questions/observations, in no particular order:

  • Have you made any tweaks to the config (in particular, anything to merge policy, translog, etc)? Would you mind pasting them?

  • I can't quite tell, but it looks like the translog size drops significantly right as the indexing throughput decreases. Was there a configuration change? How do you have the translog configured? Are you using the default request fsync behavior?

  • I noticed the disk says "EBS Magnetic". How many merge threads are configured (default, or something custom)?

  • Are you making a lot of updates to nested documents? Or are these "append-only", where you are only adding new documents?

  • Do you know if there are some "outlier" documents, that might have very large number of nested children?

  • Do you see anything in your logs about throttling "in-flight" merges?

Thanks for the notes,

  • We only use a small number of settings, the bulk-threadpool change was necessary because at some point we wrote to many different indexes, but in the latest tests only one index was written to without difference.

          # Use higher sizes for thread-pools to allow to run larger bulk-indexing
          threadpool.bulk.queue_size: 1000
          threadpool.bulk.size: 10
          
          # Try to save memory with many indexes/shards
          # see https://github.com/elastic/elasticsearch/issues/8394#issuecomment-65871792 for some discussion
          index.load_fixed_bitset_filters_eagerly: false
          
          # Use a higher wait-time that the master waits until it re-balances shards/replicas
          # See https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html
          index.unassigned.node_left.delayed_timeout: "15m"
    

Apart from that everything is using default values.

  • Index settings are

          "number_of_replicas": "1",
          "number_of_shards": "6",
          "refresh_interval": "30s"
    
  • No change to translog settings as far as I see

  • It should be primarily newly written documents, only very few duplicates, no real updates. The number of nested documents is fairly constant between 1 and 3 items per main document as this is generated synthetic data

  • Logs are completely empty

Any suggestions of additional things we can try?

Are you specifying document IDs at indexing time or letting Elasticsearch generate them automatically?

Hi, we have quite similar problem. ES 2.1.0 indexing slows down after 2.5-3 days (daily indices, append only, self-generated doc FLAKE IDs). Sorry, I cannot provide detailed information right now (still collecting logs and metrics to share). No CPU or IO pressure, in logs mostly merges (normal and throttled). But bulk execution time increases (indexing buffers goes down and keeps low) over time. The only way to get speed back is full cluster restart.

Do you have any anomaly with thread pool and load average on multiple nodes?
watch -d -n 5 "curl -s localhost:9200/_cat/thread_pool | sort"

Can you try to restart the cluster and see if indexing speed increases after it's become green again?

We have generated IDs from outside of Elasticsearch, probably UUIDs, plus routing on another id which we use to store related documents in the same shard.

We will do some runs with auto-generated IDs to see if it is related here.

If you are generating IDs at the application layer, the structure of your ID can impact performance as it may make it more expensive for elasticsearch to check if the ID already exists and need to be updated, especially if you have slow disks. If this is the case, the indexing throughput tend to drop as the shards grow in size. This blog post describes how this works under the hood.

1 Like

Yep, IDs can play a role in it. So that could be part of the slowdown, but I don't know if it would account for the entire effect.

A potential contributor might be this bug: https://github.com/elastic/elasticsearch/pull/13918

Is your data being ingested continuously, or could it be "bursty" enough for a shard to become "inactive" (default 30s edit: default 5min for a shard to become inactive after the last document index/update/delete)? If the shard falls inactive it might be affected by that bug.

Otherwise, not entirely sure, talking to some other devs. Thanks to @Dominik_Stadler and @rusty for the details/stats/info, it makes it a lot easier to kick around ideas internally :slightly_smiling:

You might try setting index.merge.scheduler.max_thread_count: 1 in your config (or update through settings api), since you are on magnetic spinning disks. Multiple merge threads are good for SSD, but tends to slow down spinning.

ES should autodetect and default to 1 for spinning (you'll see in your startup log a message about spinning true/false). I don't think this is the ultimate problem, but it might be worth a shot.

Could you also post the output of your Hot Threads API? Perhaps something interesting will show up there (although I'm guessing it will just show merge threads).

Will keep digging :confused:

I'm thinking this is somehow related to #13198 too, since you mentioned the indexing buffer goes low and stays that way. @Dominik_Stadler, do you notice your indexing buffer staying low as well?

It was also suggested to me by another dev that you might be running into EBS IOPs throttling? Do you know if you are saturating your IOPS?

Not an EC2 expert however, so that's about the extent of my knowledge there :slightly_smiling:

In addition to @rusty's post, running command watch ... shows that only one instance of Elasticsearch has large queue of bulks, whereas remaning tens instances of Elasticsearch does nothing. We use two masters to send them bulks, and we have tried to use client.transport.sniff: true to send bulks in round-robin maner. But it helped very little. Still we have one oveloaded node and decreasing indexing performance. On every cluster restart overloaded node is changing.

We tested that id generator gives unique ids on our docs. So I don't think the problem is with id...

@ivank do you use custom routing? It's not uncommon for custom routing to distribute data in a lumpy manner because your partitioning key is not evenly distributed, or potentially your custom routing is just incorrect and everything is routed to the same shard.

Can happen with parent/child relationships too, since all the children are routed to a single shard with the parent.

@polyfractal no, we don't use custom routing, just indexing docs via bulks, only ids are generated on client side, and they are unique it seems...

As I understand childs are nested object (in JSON terms)? Documents are mainly plain and we don't use any explicit relationships.

@ivank How many shards are in this index? Is it a single primary shard?

Is this related to the indexing slowdown-over-time that @rusty mentioned, or a different symptom/problem? (Are you two working together or is this a third cluster with slowdown-over-time problem?)

I just don't want to derail this thread before we figure out that issue. If it's separate, let's open a new thread and discuss there.

@polyfractal, we are working together with @rusty :slight_smile:

We have different kind of indices, some of them have couple of shards, some are big and so sized to count of nodes in cluster. All they have 1 replica

I did a few more tests, for me it does not happen if I have no nested documents instead of 1 to 3 that I had in my original set of tests. Now the indexing rate (the blue bars in the top row) drop much less and stay constant, the gap seems to be some network issue at that time.

Could it be that nested documents require more and more effort over time?

Anything to test or investigate? I can also add code-level instrumentation via Dynatrace to see where more time is spent, which code-places in Elasticsearch would be interesting here?

That's odd that you only see the slow down when indexing with nested documents.

Nested documents do require more work to index and merge, since under-the-hood, each of your 1 to 3 nested docs (plus the parent doc) are indexed as separate documents to Lucene, but that work should not be increasing over time.

Can you pull a full hot threads (pass e.g. threads=10000 to https://www.elastic.co/guide/en/elasticsearch/reference/2.1/cluster-nodes-hot-threads.html) from all nodes when things have gotten very slow with the nested documents?

So, we have collected some more info, and I want to post it here.

At the moment, we have decreasing performance by ~10 hours instead of ~3 days (because of more "right" cluster restart which doen't make elasticsearch to rebalance indices).

Here is manually combined prictures of charts. They show stats of non-overloaded node (other have similar stats).


One difference with overloaded node is in CPU utilization, of cource, because of large bulk queue.

Timeline is saved (approx.)

We have also built animation of how bulks are distributed over nodes using given watch ... command (sort is used, so nodes is not always at the same positions):
~70M docs/5min:

~40M docs/5min:

~10M docs/5min:

hot threads for 40M: http://pastebin.com/9RHdJic6
hot_threads for 10M: http://pastebin.com/Hd4ZdQF1 (smaller, of course)
stats for 10M: http://pastebin.com/WNnB5iDN
thread_pool for 10M: http://pastebin.com/k3mEtRb0 (page is heavy...)

What can we share anything else?

Thank you for posting the hot threads ... I see most hot threads doing normal bulk indexing, which is good, however many are doing ID lookups to locate the previously indexed document by that ID. This usually means too many segments (though your aggregate stats seem OK: ~64 segments per shard) or that the OS doesn't have enough free RAM for IO caching (do you leave 50% of the machine's RAM for it?).

Can you try the suggestions at https://www.elastic.co/blog/performance-indexing-2-0?

In particular, in your "hot threads for 40M" output, I can see merges are being throttled, which can cause too many segments, which will slow down ongoing indexing, and can cause index throttling.

When @rusty said " (indexing buffers goes down and keeps low)", how/where did you see that?

Are you leaving 50% of each data node's RAM to the OS for IO caching?

Are you using spinning disks or SSDs? What ES settings have you changed from defaults?

Not so 64...
Indices differ, of course. There is some indices and segment count:

index1-2016.02.02: 5602
index1-2016.02.03: 7796
index1-2016.02.04: 2229
index2-2016.02.02: 1791
index2-2016.02.03: 1918
index2-2016.02.04: 1928
index3-2016.02.02: 15
index3-2016.02.03: 11
index3-2016.02.04: 18
index4-2016.02.02: 41
index4-2016.02.03: 42
index4-2016.02.04: 49
index5-2016.02.02: 15
index5-2016.02.03: 10
index5-2016.02.04: 16
...

About RAM: 61GB of 128GB is allocated for 2 ES instances, 4 GB is for our custom app that loads data (really can use up to 30GB) and shares CPU and OS cache:

free -m
             total       used       free     shared    buffers     cached
Mem:        129010     127494       1515          1        456      25638
-/+ buffers/cache:     101399      27610
Swap:        16383        862      15521

From the provided link:

  1. We turned off throttling for store - is it good?..
  2. Yes, we use multiple path.data
  3. We use own ID because of the same transient log in our custom app (got same
    code for ID generation as ES uses)
  4. We don't use doc_values because of spinnig disks and we don’t need aggregation or sorting on ES side.

So, we turned off throttling for storing docs, is this affects merging?
Here we use settings in templates for indices:

        /* for the largest indices it’s set to 24 (by doc count <=2B and
        shard size <=50GB, balanced to the ES nodes number) */
        "index.number_of_shards": ..., 
        "index.index_concurrency": 32,
        "index.number_of_replicas": 1,
        "index.refresh_interval": "30s",
        "index.codec.bloom.load": false,
        "index.merge.policy.use_compound_file": true,
        "index.merge.policy.floor_segment": "50mb",
        "index.merge.policy.max_merged_segment": "500mb",
        "index.merge.policy.reclaim_deletes_weight": 0.0,
        "index.merge.async_interval": "10s",
        "index.merge.scheduler.type": "concurrent",
        "index.merge.scheduler.max_thread_count": 2,
        "index.translog.flush_threshold_size": "500mb",
        "index.ttl.disable_purge": true,
        "index.codec": "best_compression",
        "index.store.type": "niofs",
        "index.store.throttle.type": "none",
        "index.warmer.enabled": false,
        "index.mapper.dynamic": false

the largest indices have 22 & 11 shards, others are small and have 1-2 shards.

And settings for Elasticsearch itself:

...
names, paths here... nothing interesting
...
bootstrap.mlockall: true
network.host: 0.0.0.0
http.port: 9200-9400 ports range as usual
transport.tcp.port: 9300-9400 ports range
gateway.recover_after_nodes: 1
discovery.zen.ping.unicast.hosts: ["master_01:9300", "master_02:9300"]
node.master: true for 2 masters
node.data: true for 20 datas
http.cors.enabled: true
http.cors.allow-origin: "*"
node.host: some host
cluster.routing.allocation.awareness.attributes: host
cluster.routing.allocation.same_shard.host:  true
os.processors: 8
threadpool.bulk.size: 30
threadpool.bulk.queue_size: 3000
threadpool.search.size: 30
threadpool.search.queue_size: -1

By indexing buffers he meant indices.segments.index_writer_memory_in_bytes taken from _nodes/stats (third chart on Docs vs CPU.png). But I believe that this decreasing is the consequences.

Hmm some of these settings are not good. Can you revert to defaults and re-test?

E.g.

"index.merge.policy.max_merged_segment": "500mb"

is dangerous (it's 10X smaller than the default) since it can cause too many segments for large shards.

"index.store.type": "niofs"

is very dangerous since this Directory impl. is poor for random access IO needed for the ID lookups.

threadpool.bulk.size: 30
"index.index_concurrency": 32

You should not set these two beyond the number of true cores your hardware has: doing so just hurts indexing throughput and stresses IO system (newer versions of ES now enforce this: https://github.com/elastic/elasticsearch/pull/15585). If anything, set it a bit lower than your true core count ...

"index.codec": "best_compression"

This is maybe the wrong tradeoff, since it means more CPU at indexing (and merging) time for smaller disk usage.

"index.merge.policy.reclaim_deletes_weight": 0.0

That's bad to do if you are doing deletes and you care about search performance, since it means the index can accumulate too high %tg deleted-but-not-reclaimed docs.

Some of your other settings have been removed from ES recently.

It's best to set shard count for an index only as large as is really needed to 1) ensure you never hit the 2.1B per-shard doc limit, and 2) get enough search time concurrency for your search QPS needs. Setting it higher just adds risk of indexing slowdown when some nodes in your cluster are overloaded...

Finally, your output from free seems to show that only ~27 GB is available as buffer cache? Somehow your processes are using up the other 101 GB of RAM? You should try to leave ~50% or more for the OS to cache the hot blocks from the index.