Query performance degradation after upgrade to 5.3.1 ( from 1.4.5 )


(L.G. Mulder) #1

Hello,

Looking for help or hints to solve an enormous performance degradation after upgrading ES to 5.3.1

Long story but trying to be as complete as possible in the first post.
I know this question is asked before but none of the answers where helpful to us.

The hardware setup
Ubuntu 16.04.2
java 1.8.0_131 and 1.8.0_121

23 nodes
20 data nodes
1 dedicated master node without data ( active one) , 3 other eligible master nodes that contain data.
2 dedicated client nodes ( data=false, master=false)
70GB memory and 600GB disks ( SSD ) per node, 29Gb for JVM for all nodes below the threshold( Compressed Oops mode: Zero based )
Nodes are Virtual machines, 4 virtual machines on one server.
Two indexes both with roughly 800 million documents. ( using parent child and nested document )

The situation
We were running on ES 1.4.5 on the above hardware.
This setup resulted in an average response time of less then a second.
Query load per day is roughly :
50.000 relatively complex has_child query's, hitting both indexes and every shard in the cluster using filters ( terms ) on the parent records
150.000 relatively complex query's on a records specified by various terms filters
200.000 simple query's that are routed to a specific shard using routing an parent_id.
The problem with the 1.4.5 setup was that during the nightly indexing job ( avg delete 500.000+ records, add 500.000+ records) query response sometimes went up to more then 20 seconds and timed out.

So we decided to go to ES 5.3, changed mapping, query code etc. to meet all the requirements.
Data was ingested on 8 nodes. Testing on these 8 nodes was acceptable, we did a "real world" test, switching production between the two set-ups, performance degraded a bit but decided that this was due to less hardware resources.
After this initial test we did two things.

  1. We installed a basic license for x-pack ( test was done wit the trial license in place)
  2. We added the other nodes to the cluster and including the dedicated master etc.

After those two steps and going life performance degraded to the current state, query response is 8 seconds or more on the has_child query's that used to complete within a second and 4 seconds or more on other query's that hit all the shards and were done in 0.5 seconds
Query that are routed to a specific shard don't seem to have any problems.

We tried various things to resolve this:

  1. changed query code to avoid filters, no change
  2. changed query code to use more filters, no change
  3. change query code to use, or not use, Boolean query in filter
  4. Removed aggregations, no change
  5. added the timeout to the query code but like the documentation says that is sort of a hit an miss, no real change except for incomplete results.
  6. tried the "index.queries.cache.everything" : true parameter, no change
  7. tried to reduce the number of nodes, no change
  8. did a trace on network traffic but no obvious problems there

other observations
Looking in the stats I see that query_chache and request_cache are hardly used and/or suffer from a lot of evictions.
tried search_profiler on some query's and overall performance is low but it looks as if some shards are consistently worse then others ( fastest shard 0,3 seconds, slowest 3,3 seconds), but did not run enough tests to be sure.

Added some info from index/_stats below

"query_cache": {
"memory_size_in_bytes": 0,
"total_count": 87594337,
"hit_count": 26258747,
"miss_count": 61335590,
"cache_size": 0,
"cache_count": 187781,
"evictions": 187781
},
"fielddata": {
"memory_size_in_bytes": 756706952,
"evictions": 0
},

"request_cache": {
"memory_size_in_bytes": 112946403,
"evictions": 0,
"hit_count": 263349,
"miss_count": 757226
},


(Nik Everett) #2

You want three total.

I've rarely found these to be useful. I'm really curious what they buy you.

Ewww. You can't do daily indices because of parent/child, right? That is certainly a drawback.

I wouldn't expect great hit rates on this kind of setup anyway. The request cache works well with similar queries over time ranges.

I wonder if your slowest shards are unbalanced by the parent/child or they are stuck doing some updates required. Can you post a snapshot of hot_threads?


(L.G. Mulder) #3

Hello Nik,

I want at least three masters, i was going to tidy up things later.

I only use one client node, installed a nginx proxy on it in order to control access and its gives me persistent connections to ES. I can probably do without but in the old situation it it worked well, its also a fully configured node that i can add to the cluster if needed.

Daily updates used to work fine, but we have a steady increase in the number of records. Adding hardware did solve it for some time but ES 5 promised to be more effective, hence the update.

Can you explain why you don't expect great hit rates on this kind of set up?
I can excecute querys with size=0 they do get cached, gets really fast then, but in our situation that is a useless query because i need the document id's and its content.

The shards that are slow are the larger shards ( between 30 and 47 gb) , but as far as i can see no other process running on them.
Hot_threads

Thanks


(Nik Everett) #4

This:

The request cache is great when you have something like a dashboard that runs over and over again with the same queries but maybe a sliding time window or something.

Looking at your hot_threads - I'm not really sure what is up. You seem to be spending the majority of your time on points based queries but that takes it squarely out of my area of expertise.


(L.G. Mulder) #5

Thanks for the feedback.

What do mean by "points based queries"?


(Nik Everett) #6

Numbers of some form.

So this isn't normal for Elasticsearch but I've no idea how it got this way. Could you post some kind of anonymized reconstruction of it? Something that I can run without getting your entire setup. Like a bash script that builds the index and runs a bad query. Or something. If you are able to send me the index and a query I'd be happy to crack it open and try and figure out what is up. You can email me directly at me email address in my github profile (https://github.com/nik9000) or you can post a reply here.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.