Looking for help or hints to solve an enormous performance degradation after upgrading ES to 5.3.1
Long story but trying to be as complete as possible in the first post.
I know this question is asked before but none of the answers where helpful to us.
The hardware setup
java 1.8.0_131 and 1.8.0_121
20 data nodes
1 dedicated master node without data ( active one) , 3 other eligible master nodes that contain data.
2 dedicated client nodes ( data=false, master=false)
70GB memory and 600GB disks ( SSD ) per node, 29Gb for JVM for all nodes below the threshold( Compressed Oops mode: Zero based )
Nodes are Virtual machines, 4 virtual machines on one server.
Two indexes both with roughly 800 million documents. ( using parent child and nested document )
We were running on ES 1.4.5 on the above hardware.
This setup resulted in an average response time of less then a second.
Query load per day is roughly :
50.000 relatively complex has_child query's, hitting both indexes and every shard in the cluster using filters ( terms ) on the parent records
150.000 relatively complex query's on a records specified by various terms filters
200.000 simple query's that are routed to a specific shard using routing an parent_id.
The problem with the 1.4.5 setup was that during the nightly indexing job ( avg delete 500.000+ records, add 500.000+ records) query response sometimes went up to more then 20 seconds and timed out.
So we decided to go to ES 5.3, changed mapping, query code etc. to meet all the requirements.
Data was ingested on 8 nodes. Testing on these 8 nodes was acceptable, we did a "real world" test, switching production between the two set-ups, performance degraded a bit but decided that this was due to less hardware resources.
After this initial test we did two things.
- We installed a basic license for x-pack ( test was done wit the trial license in place)
- We added the other nodes to the cluster and including the dedicated master etc.
After those two steps and going life performance degraded to the current state, query response is 8 seconds or more on the has_child query's that used to complete within a second and 4 seconds or more on other query's that hit all the shards and were done in 0.5 seconds
Query that are routed to a specific shard don't seem to have any problems.
We tried various things to resolve this:
- changed query code to avoid filters, no change
- changed query code to use more filters, no change
- change query code to use, or not use, Boolean query in filter
- Removed aggregations, no change
- added the timeout to the query code but like the documentation says that is sort of a hit an miss, no real change except for incomplete results.
- tried the "index.queries.cache.everything" : true parameter, no change
- tried to reduce the number of nodes, no change
- did a trace on network traffic but no obvious problems there
Looking in the stats I see that query_chache and request_cache are hardly used and/or suffer from a lot of evictions.
tried search_profiler on some query's and overall performance is low but it looks as if some shards are consistently worse then others ( fastest shard 0,3 seconds, slowest 3,3 seconds), but did not run enough tests to be sure.
Added some info from index/_stats below