We recently updated to 2.4, and we're seeing some sluggish performance. It also seems like Elastic is underutilizing available memory.
It's worth noting that in our simple test environment, which is a single node AWS instance with 16gb of its 32gb total ram allocated to Elastic that was running 2.3 under docker, seemed to have more consistent performance.
The behavior is that we'll run a number of similar searches, everything seems performant, but when you switch up the search to be say in a different geographic area or change up the keywords used to drive the search it sometimes will be laggy. To me that seems like it's not caching enough in memory (only about 10-15% of max on data nodes at any given time).
We have 4 data-only nodes w/ 64gb ram each running RHEL 7.1 w/ raid'ed SSDs. 31GB is allocated to Elastic, but it never seems to go beyond using around 4gb per data node. Our usage isn't particularly high, but it's higher than normal. I would expect to see more usage of the available memory.
One complication is that we also moved to use docker. This was in part to give us a quick rollback option. We have 1.7 installed natively on our data nodes, but it's disabled now. All indexes were copied on the filesystem for the new 2.4. instances to use, so if we needed to roll back quickly to 1.7 we still had/have those indexes in place (though at this point they're 3 days behind).
The marvel agent seems to be indexing pretty frequently ... we are seeing an index rate of around 50 docs/s on the .marvel-es-2016.09.13 index at any given time FWIW.
We're thinking that, since we never had to roll back, we might abandon docker and run 2.4 directly on the machine to see if it makes a difference. But I'd prefer to figure this out if possible.
Things we have been looking at:
- Memory swapping seems fine, that is to say the docker container seems to not be swapping inheriting our vm.swappiness=1 from the host machine
- The 2.4 data nodes are definitely getting their 31gb allocated to the JVM
- When we move our transport client to point to our data nodes instead of a dedicated client node we get anecdotally better performance. We're looking to put into place a beefier client node (or nodes) but it doesn't seem like it's going to totally solve the issue
We're going to run up our AWS instance again to get something more than anecdotal numbers, but we're on a deadline (aren't we all ) so I was hoping tossing this out to the field might yield something obvious we're missing...