Indexing performance degrading over time

Michael thank you for answer. We'll definitely try the defaults and share results.

Some words about our case: we are making logs storage with search capabilities on almost all key-metrics. All kind of logs have different retention policy (from 1 to 14 month) so we decided to use daily indexes and remove old indexes by using curator app. All indexes are append only. As we don't need sorting and aggregations we disable doc_values and enable best_compression as this is big cost saver (30-50% reduce of storage requirements compared to ES 1.7). We use the settings from 1.7.x as they were efficient enough and expect only trouble with terms_in_memory. But to our surprise got slowdown in indexing speed.

It's best to set shard count for an index only as large as is really needed to

  1. ensure you never hit the 2.1B per-shard doc limit, and
  2. get enough search time concurrency for your search QPS needs. Setting it higher just adds risk of indexing slowdown when some nodes in your cluster are overloaded...
  1. Yes, we are using this limit in shard number calculation (<=50Gbyte for shard size and less than 2B document in index, and number of shard should be equal or less than node count just not to get hot point then on one node is more shard than on others)
  2. Ok, It's the next step.

Finally, your output from free seems to show that only ~27
GB is available as buffer cache? Somehow your processes are using up
the other 101 GB of RAM? You should try to leave ~50% or more for the
OS to cache the hot blocks from the index.

We have 2 instances of Elasticsearch on one bare-metal server: 61GB is for Elasticsearch (30.5 + 30.5), 4 GB reserved for our app that loads data (it can use up to 30GB). We'll try to minimize memory for our app to free up more memory for Elasticsearch.

But still, why is there one node overloaded? Is it normal that one node has large bulk queue (according to gifs)?

We tested id generator and it distributes docs to shards normally. Bulks are sending to master nodes, so they have to split into bulks for concrete shards, so every data node must have approx. equal shard bulks and consequently uniform load on all nodes. Am I right?

"index.store.type": "niofs"

We think that mmapfs, which is included to default_fs, will use too large heap with time.

That is unfortunately a persistent myth: MMapDirectory does not increase heap usage, despite how horrifying the output from "top" or "ps" might seem.

Have a look at Uwe's blog post to understand why: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Maybe that one node has less free RAM than others and it becomes the bottleneck? Maybe the node restart frees up the RAM for a while until processes take it away from the OS?

OK, we removed some settings from templates, and it gave some positive effect (as I see it), but nothing changed with single overloaded nodes.

I have tried to manually move shards from overloaded node to anothers (previously forbidden to cluster to rebalance) and bulks have been much increased (as on gif for 70M). It seems that it repeats with new indices for the next day. So, it really seems overloaded node is bottleneck for bulks.

Also we tried to stop loading data for some time. We waited until all bulks are processed and continued to upload data... and overloaded node came back...

For now, we stopped to load main part of data. All nodes does nothing now as expected, except... overloaded node. I'm confusing... what does this ES's node? this machine has 30-40 avg load, 30 bulk threads and 80-100 in queue. top shows 2000-3000% cpu for ES process while second one on the same machine is ok and does nothing. That instance itself reports that it uses 80-90% cpu.
Hot threads: http://pastebin.com/Gtfbk2Mg
What does it mean?

Hmm your hot threads output on that "still busy even after stopping bulk indexing" node is unhealthy, with nearly all threads doing this:

  97.7% (976.8ms out of 1s) cpu usage by thread 'elasticsearch[test_data_11-d2][bulk][T#7]'
     10/10 snapshots sharing following 22 elements
       java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(Unknown Source)
       java.lang.ThreadLocal$ThreadLocalMap.remove(Unknown Source)
       java.lang.ThreadLocal$ThreadLocalMap.access$200(Unknown Source)
       java.lang.ThreadLocal.remove(Unknown Source)
       java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(Unknown Source)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(Unknown Source)
       java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(Unknown Source)
       org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
       org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:365)
       org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:531)
       org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:476)
       org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69)

It's as if you have waaay too many ThreadLocal instances and these threads are stuck walking through all of them pruning the now unreferenced ones.

Which Java version are you using?

Do you still have any settings increasing e.g. thread pool sizes?

Sounds strange...

We use JRE version 1.8.0_66.

We didn't touch any settings except removed from templates which became default:

index.merge.policy.floor_segment
index.merge.policy.max_merged_segment
index.merge.scheduler.*
index.translog.flush_threshold_size
index.store.type

So, it seems that that node stopped to do such strange things when new (for today) indices have been created, and cluster worked fine today - up to 90M docs/5 min. But now I see again that one node (on another machine) have permanent queue of ~200 bulks, but seems it doesn't delay others for now. All ES's nodes have almost the same environment and I don't see what may be the cause for this queue.

Will see what will happen on this weekend.

Hmm what settings remain? Do you launch ES with a different number of cores than it would have detected via Java's Runtime APIs?

These hot threads are very bad, a smoking gun.

We need to understand why you're hitting them ... I think it must be either a bug somewhere, or an extreme setting tickling something bad.

My observation is, a lock of class java.util.concurrent.locks.ReentrantReadWriteLock is passed to new ReleasableLock(...), and ReleasableLock uses ThreadLocal.

ReentrantReadWriteLock of JDK is also using ThreadLocal internally.

I think the getReadHoldCount() issue of ReentrantReadWriteLock http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6625724 can not be underestimated. Quoting

The maintainers of ReentrantReadWriteLock now regret having added this extra overhead. Unfortunately, we cannot withdraw support for this, since we added the public getReadHoldCount method.

It could be worth to replace java.util.concurrent.locks.ReentrantReadWriteLock in favor of a customized class that does not make use of ThreadLocal. Maybe ReleasableLock should be reorganized at the same time, too, as it looks like a wrapper around java.util.concurrent.locks.ReentrantReadWriteLock.

Are you running java with assertion enabled for Elasticsearch classes? The ReleasableLock only uses ThreadLocal if assertions are enabled.

That said, I don't think this nor the JVM bug @jprante cited ought to matter here (and I would rather not fork JDK classes into ES), because we limit the number of unique threads based on the bulk and index thread pools, which ought to limit the size of the backing map used by ThreadLocal.

@ivank would it be possible to capture a heap dump of the "node still busy after bulk indexing stopped" while its hot threads are reclaiming ThreadLocals? This could help understand why that backing map has so many entries...

We specified 8 cores (as it turned out, it is wrong)

os.processors: 8

but it's less, not greater...

Runtime.availableProcessors() returns expected 32 cores.

The ReleasableLock only uses ThreadLocal if assertions are enabled.

We don't use -ea at all.

a heap dump of the "node still busy after bulk indexing stopped"

First, we need to see again such node, but we don't at the moment.
Second, I don't think we can transfer such big dump...

We are trying to reproduce problem with overloaded node (which had no entries with ThreadLocalMap.expungeStaleEntry, just normal bulk) on our small cluster (4x data nodes, 4 masters), but we cant :frowning: And I don't think we can reproduce ThreadLocalMap.expungeStaleEntry at all.
If we could, we would take dump of such instance, of course.

Hmm, I see we made a mistake, there is just processors setting, right?
If so, each of the two instances used 32 processors by default instead of 16? Or they are detecting each other on the same machine?

Yes, it's just processors, and I don't think the nodes will detect they are sharing the same cpu(s) and divide the processors in half. So, yes, it sounds like you had two nodes on one box, each thinking it had 32 cores to work with.

But I would be surprised if this led to the hot threads you saw (trying to clear out ThreadLocal instances from expired threads). It seems like something is making an excessive number of ThreadLocal instances somewhere...

So, we have reduced available number of proccessors for ES, plus increased shard count for large indices and set some template settings to default - all it gave positive effect, but one overloaded node still remains, because of which cluster performance slows down with time... We also tried to use node client type to send bulks directly to shards.
Initially, nodes become overloaded in turn, but now we have one all this day. hot_threads on overloaded node shows now stacktrace above with java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry in almost all snapshots in bulk threads. I have done heap dump and started jhat on it to view it in browser (we cant pick out this dump because it is on remote machine).
So, I have looked at tables in ThreadLocalMaps, and they have sizes 2^14 and 2^15. Table is sparsed, but some ranges is big (up to 4000 elements). Non-null elements are WeakReference to null (the same WeakReference object is referenced in many other indices). Also I saw one weak ref to HashMap, but I think that there is more WR's with nulls . What does this mean all?
What can I post here, were should I look in dump?

@mikemccand, could you look at this, please?

Hi @ivank,

Can you run jstack <pid> and post the result, on the hung node? This will show all thread names and what they are up to.

Can you also run https://github.com/elastic/elasticsearch-support-diagnostics and post somewhere? This pulls more detailed stats from the cluster.

Hi! Glad to see you again!

Yeah, I did such stack dump, but it shows another stack trace then hot threads, and without ThreadLocalMap.expungeStaleEntry on top.
I think I can do that on next week + we have to modify output from provided utility, of course.

Hey i have been seeing the same issue for a while, do you suggest to open a new discussion or can i put in the data i have here
I see the cpu going to 95+% with hot threads on

66.6% (332.9ms out of 500ms) cpu usage by thread 'elasticsearch[host][search][T#25]'
10/10 snapshots sharing following 14 elements
java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617)
java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499)
java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298)
java.lang.ThreadLocal.remove(ThreadLocal.java:222)

if i increase the refresh interval from 30s to 60s the cpu seems to come down, but when i revert it back the cpu on the host starts increasing and stays there until i send a
sudo kill -QUIT {pid}
and everything comes back to normal

This issue resolved itself when i added clients to handle the routing logic and disable http on data nodes. But we still have the some host stuck in the above state until i go and do a sudo kill -QUIT

@Arun_Agarwal hi, you are welcome to share you case and details. Which ES version do you use?

@mikemccand Right now we found temporary solution - restart of stalled ES instance, this solves the problem for a while (6-24 hours) and then the issue occures again. Another changes we've tried:

  1. increase shard number (this increase througput a little)
  2. replace transport client with node client (decrease network overhead)
  3. change refresh_interval to 60s but this does not change the things.

A little later @ivank will share jstack and support diagnostics, we have a trouble with NDA, so need to change sensetive data (ip, index and field names) and this slows things a lot.