Possible perf regression between 5.1.2 and 5.3.2?

Hi,

we've just finished a rolling upgrade to bump from 5.1.2 to 5.3.2. We then noticed a visible perf regression in query percentiles 1:

  • fulltext queries (p75): 33ms => 50ms
  • morelike (p75): 90ms => 150ms
  • comp suggest (p75): 10ms => 15ms

Young gc activity increased while the overall heap usage seems to have decreased.

We're still in the exploratory phase of this perf regression but any help would be welcome to narrow down our search scope.

Thanks!

I'm working with @dcausse, I'm adding a few more details:

Cluster wide perf graphs are on grafana. Comparing individual nodes is also possible (on the elastic1* nodes).

We upgraded the JDK to 1.8.0u131 (minor upgrade from 1.8.0u121).

Comapring GC logs before the upgrade to after the upgrade for one of the servers, I can see:

  • allocation rate is fairly similar (from 1.44Gb/s to 1.35Gb/s)
  • avg young GC duration is similar (35ms)
  • young GC interval decreased from 6.8s to 1.1s

=> we are allocation about the same amount of heap, but taking much longer to collect it

Digging into this, it looks like we re-aligned our JVM options on the standard elasticsearch. We activated Concurrent Mark and Sweep GC, which does some strange things to NewRatio. Our young gen is now 2Gb instead of the previous 10Gb (the default NewRatio=2, with a total heap of 30Gb). I'm going to try remove that option and see if things go back to normal.

I remember playing with GC settings for Elasticsearch when I was with WMF but I didn't get anywhere as far as speed. I imagine those GC setting post-date me. It sounds like someone was able to get a significant bump by experimenting with them. I sure don't think it was me.

It wasn't me either :slight_smile:

We still have some work to do to tune it better. In particular experimenting with Heap size. I'm pretty sure that our current 30Gb is way too large, but validating this is going to take some time...

Indeed. Y'all don't use many of the things that take up a ton of heap. On 5.x you have many more protections against using too much heap too.

Would you mind sharing what your queries look like? Also,can you try to isolate the issue eg. by disabling highlighting, aggregations and simplifying the query as much as possible, and then adding those features back again one by one to try to see if there is anything in particular that makes your queries slower compared to 5.1.2?

Capturing nodes hot threads while the cluster is under load might also help.

Sure, the queries we send are:

  • fulltext search: json (basically a filtered boolean + 2 rescore queries)
  • morelike: json (a simple morelike query + 1 rescore query)
  • comp suggest: (sadly our query dump does not work for them) this is running 4 queries on 2 FST)

Since the problem was detected on the production system it's hard for me to debug and isolate query components that may be slower. I'll setup a test environment and run some benchmarks.

The hotthread does not show anything particular (mainly indexing threads).

Thanks for your suggestions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.