Double CPU usage after 1.4.5 to 2.0


(John Veldboom) #1

Is anyone else experiencing higher CPU usage after upgrading to 2.0?

We were planing to migrate to 2.0 over the weekend but our initial tests show 2.0 is using at least twice the CPU - sometimes much more - than our nodes running 1.4.5. It just does not seem right that the test 2.0 node with around 5,000 documents is using more resources than our 1.4.5 with around 4 million documents. They're both on the same AWS instance type.

Any help in pointing us in the right direction would be greatly appreciated. Thanks!


(Zachary Tong) #2

What sort of tests are you running? Indexing? Searching? Both at the same time?

Could you check the hot_threads output and attach it here?


(John Veldboom) #3

We're currently running just indexing tests on a single node. There's a backend process that deletes the index, recreates it and then inserts around 1,000 documents. No searching since it's on testing server.

Here's the output from the localhost:9200/_nodes/hot_threads

::: {node-1}{2ahNk7oWTfO-ewGrVvVj2g}{10.0.1.44}{10.0.1.44:9300}
  Hot threads at 2015-11-20T15:19:48.988Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (131.5micros out of 500ms) cpu usage by thread 'elasticsearch[node-1][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:745)

Also here's the CPU usage on 1.4.5 and below on 2.0. Both running the same 15 minute index job mentioned above. I had to just create the 2.0 instance so there's no as many data points but you can still see the spikes.

1.4.5

2.0


(Zachary Tong) #4

If you could gist up the whole hot_threads, that would be helpful. It doesn't sort them by most activity, more like a big jumble of threads, so I'd have to skim the whole thing.

Since you're indexing, if I had to guess, I'd say it is due to doc values. In v2.0, doc values were enabled by default for all not_analyzed fields. Doc values are essentially a serialized column store that replaces field data. The benefit is that going OOM is no longer a problem when doing most aggregations. The cost is paid at indexing time though, because doc values require extra processing/compression/serializing during the segment creation process.


(John Veldboom) #5

That was the whole hot_threads output from curl localhost:9200/_nodes/hot_threads Should I be looking somewhere else?

The index has a lot of not_analyed fields (like 130 out of 150) so that would explain the indexing spikes, correct? And I'm assuming the load would level out after the creation process.

After reading the documentation, it looks like it might be beneficial to not use doc values on this index since it's always going to be small and it's being recreated every 15 minutes. But I do love the idea of it being on by default for the other larger ones - I'll trade speed for stability.


(Zachary Tong) #6

Ah, must have just hit hot_threads at a slow time. It takes a snapshot of the currently running threads, so sometimes you can get unlucky and just hit it at an "idle" snapshot.

Given the number of fields going into doc values now, I think that's probably what we're seeing. Once you stop indexing, the CPU behavior should settle back down to idle (once the pending merges finish). The big spikes are likely a few large merges happening, while the lower background noise are smaller merges.

Note: the portions of the Definitive Guide that talk about Doc Values are relatively old...we wrote that way back at 1.0 when doc values were still pretty new and unoptimized. They've consistently improved since then, so I wouldn't worry about the overhead too much to be honest.

If you know this index will always be small and replaced consistently, you might consider disabling doc values...just with the caveat that the behavior might not last forever (e.g. when 3.0 rolls around). I believe the plans are to completely remove fielddata completely, so if you want aggregations you'll have to use them regardless.

But if some of them are never aggregated on at all (just for search), that'd be an easy win. And since the index is replaced frequently, you can always re-enable it in the future for those fields.


(Zachary Tong) #7

Oh, also another note, ES auto-tunes merge behavior in 2.0 now. So some of those spikes may be because Lucene noticed the system was essentially idle and decided to let the merges have the full run of the IO (which would increase CPU usage too). In Lucene's eyes, idle resources are there to be consumed, so it's letting itself get things done faster =)


(John Veldboom) #8

Man thank you for your help on this! This appears to be exactly what's going on. I ran a simple file import using the default mapping with the doc value and then again without ("doc_values" : false). Left highlighted is with and right is without.

Almost exactly 2-3x CPU usage which is what I was seeing before - love reproducible issues and fixes!

Thanks for the note on the auto-tune behavior as well.


(Zachary Tong) #9

Np, happy to help!

Just a friendly reminder that idle resources are there to be utilized, and Elasticsearch is rarely CPU bound. I wouldn't prioritize CPU usage over functionality unless you are actually hitting a bottleneck. =)


(John Veldboom) #10

I do appreciate your help but I'm not convinced this issue is totally resolved. I woke up this morning to an unresponsive server. CPU and memory were all maxed out. Along with extremely high disk i/o.

More than happy to help look into this further to help improve ES though. For now we are just going to roll back to 1.4.5 since it's been solid as a rock for us.


(Zachary Tong) #11

Was this before or after you disabled doc values? Do you aggregate on those fields heavily? If it was after disabling doc values (e.g. back to field data) and you aggregate on them, you may just have blown out your memory from aggregations.

Any logs from the incident? Logs + Hot threads would help, as well as Node Stats output.


(John Veldboom) #12

I did disable the doc values on the one index that gets deleted and recreated every 15 minutes. That one seemed pretty stable.

What caused the server to completely go down was the other indexes I restored from a snapshot. (snapshot from v 1.4.5) The index is sales data shared by month over the past 2 years - roughly 150k records per month.

Restoring those indexes seemed to go fine and the server's load was fine until the nightly update which adds the sales data from that day. For Friday it was just over 4,300 records added. But that update caused the system to crash and never fully recover.

It was Saturday and I really did not want to spend my day off researching the issue. So I killed the server and restored a backup form 1.4.5. But I'll see if I can recreate the issue locally and get some logs and other output to help debug this.

Really want to move to 2.0 to take advantage of those new pipeline aggregations. Thanks again for your help with this.


(Zachary Tong) #13

Hm, ok... I don't have any guesses at this point to be honest. Did you happen to run the the migration plugin on the "old" snapshotted data to make sure it didn't have any upgrade problems? Perhaps there was an existing upgrade problem in the old data (conflicting mapping or something) and a bug which allowed the Restore to proceed anyway? Dunno.

Will wait to see if you can recreate it and/or grab logs from the event, plus I'll ping a few folks internally to see if the Restore thing could be a problem.

EDIT: Just asked a colleague, it's unlikely a problem with the Restore, since that uses the same mechanism we use to check shard versions when recovering or loading dangling indices. Sooo I'm fresh out of ideas, will keep thinking


(Monotek) #14

Maybe you're running in the same problem as me after update to 2.0: Elasticsearch crashes after update to 2.0


(system) #15