Sluggishness after 1.7 -> 2.4 update

We recently updated to 2.4, and we're seeing some sluggish performance. It also seems like Elastic is underutilizing available memory.

It's worth noting that in our simple test environment, which is a single node AWS instance with 16gb of its 32gb total ram allocated to Elastic that was running 2.3 under docker, seemed to have more consistent performance.

The behavior is that we'll run a number of similar searches, everything seems performant, but when you switch up the search to be say in a different geographic area or change up the keywords used to drive the search it sometimes will be laggy. To me that seems like it's not caching enough in memory (only about 10-15% of max on data nodes at any given time).

We have 4 data-only nodes w/ 64gb ram each running RHEL 7.1 w/ raid'ed SSDs. 31GB is allocated to Elastic, but it never seems to go beyond using around 4gb per data node. Our usage isn't particularly high, but it's higher than normal. I would expect to see more usage of the available memory.

One complication is that we also moved to use docker. This was in part to give us a quick rollback option. We have 1.7 installed natively on our data nodes, but it's disabled now. All indexes were copied on the filesystem for the new 2.4. instances to use, so if we needed to roll back quickly to 1.7 we still had/have those indexes in place (though at this point they're 3 days behind).

The marvel agent seems to be indexing pretty frequently ... we are seeing an index rate of around 50 docs/s on the .marvel-es-2016.09.13 index at any given time FWIW.

We're thinking that, since we never had to roll back, we might abandon docker and run 2.4 directly on the machine to see if it makes a difference. But I'd prefer to figure this out if possible.

Things we have been looking at:

  1. Memory swapping seems fine, that is to say the docker container seems to not be swapping inheriting our vm.swappiness=1 from the host machine
  2. The 2.4 data nodes are definitely getting their 31gb allocated to the JVM
  3. When we move our transport client to point to our data nodes instead of a dedicated client node we get anecdotally better performance. We're looking to put into place a beefier client node (or nodes) but it doesn't seem like it's going to totally solve the issue

We're going to run up our AWS instance again to get something more than anecdotal numbers, but we're on a deadline (aren't we all :slight_smile:) so I was hoping tossing this out to the field might yield something obvious we're missing...

To add:

  1. the primary index that is being used was recreated and reindexed following the upgrade. it's not a holdover from 1.7 (design wise it is but the index itself was rebuilt)
  2. the index is about 523gb, spread across 24 total shards (12 primary, 12 replica), 6 shards per node. that's around 22gb/shard.

It does seem pretty suspicious that ES is only using 4GB memory. Is that heap utilization or overall memory?

FYI, we've been running ES in production in Docker for about 3 years and have never experienced this type of issue. My first thought was that maybe you had a memory limit set on the container but if you're seeing the full 31GB heap available then that wouldn't be the case. We haven't yet tested 2.4 though so I guess there could be something there.

Kimbro

yeah we had run up some docker images previously to work out the kinks with heap and other stuff.

swappiness seems to inherit, so I don't know if it's that.

is a elastic 2.4 index compatible with 2.3?

I'd be surprised if you can go backwards from 2.4 to 2.3 but don't know for certain.

Ok one other possible point of investigation that we didn't think of until after...

For the master + client nodes, which are VMs already (not ideal but it works) we aren't using docker (running docker inside a VM doesn't seem to make sense) and those have always run Oracle/Sun's JDK (Java8)

For the data nodes, since we're using a slightly modded version of the standard dockerfile, it's using OpenJDK 8.

That's probably something we need to rectify. Not sure it explains our issues.