Digging into this, it looks like we re-aligned our JVM options on the standard elasticsearch. We activated Concurrent Mark and Sweep GC, which does some strange things to NewRatio. Our young gen is now 2Gb instead of the previous 10Gb (the default NewRatio=2, with a total heap of 30Gb). I'm going to try remove that option and see if things go back to normal.
I remember playing with GC settings for Elasticsearch when I was with WMF but I didn't get anywhere as far as speed. I imagine those GC setting post-date me. It sounds like someone was able to get a significant bump by experimenting with them. I sure don't think it was me.
We still have some work to do to tune it better. In particular experimenting with Heap size. I'm pretty sure that our current 30Gb is way too large, but validating this is going to take some time...
Would you mind sharing what your queries look like? Also,can you try to isolate the issue eg. by disabling highlighting, aggregations and simplifying the query as much as possible, and then adding those features back again one by one to try to see if there is anything in particular that makes your queries slower compared to 5.1.2?
Capturing nodes hot threads while the cluster is under load might also help.
morelike: json (a simple morelike query + 1 rescore query)
comp suggest: (sadly our query dump does not work for them) this is running 4 queries on 2 FST)
Since the problem was detected on the production system it's hard for me to debug and isolate query components that may be slower. I'll setup a test environment and run some benchmarks.
The hotthread does not show anything particular (mainly indexing threads).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.