Performance degradation after ES upgrade (v1.7 -> v5.2)


(Tomasz Elendt) #1

Long story short - last year we tried to migrate from Elasticsearch 1.7 to 2.4 but we noticed quite significant performance degradation (details in this thread). We tried all suggested changes but we were not able to match ES 1.7 performance (for our queries/traffic). That said, we were clearly abusing some of ES functionalities: we were using terms filter with massive amount of terms and really complex nested filters. We removed these -- replaced the massive terms filter with simple boolean document flag and single term filter and flattened our "nested structures". After that we decided to try second time, this time with latest ES 5.x (5.2.2 at that time). We aligned all our tooling to new ES APIs (mapping, settings, clients, custom scorer plugin, etc) and load tested 5 nodes cluster with recorded live traffic.

Unfortunately, we still see ES 1.7 performing better (much better) for our workload. We can easily handle 375 req/s with 5 node ES 1.7 cluster, but we had a problem to serve 100 req/s with same size 5 node ES 5.2 cluster. We also noticed that it takes much longer for ES 5.x to warm up properly and that it's much more sensitive to sudden increase of request rate.

In order to solve performance issue we tried following tweaks:

  • We tuned all configuration parameters according to documentation: we set proper heap size, etc
  • We observed high query cache misses (hard to compare to ES 1.7 since it doesn't expose this metric) and we wrote custom caching strategy, that replicates our old (hand tuned) caching. It improved caching, but it didn’t improve performance. Our guess is that filter caching is not a bottleneck
  • We backported an old hybrid store mechanism from Elasticsearch 1.7. No difference.
  • We tried preloading various data into filesystem cache (index.store.preload).

Most of the system metrics look much better on 5.x cluster (less disk reads, due to smaller heap and better filesystem cache utilization) just load and latencies look worse.

We see that the most problematic for new ES cluster are queries formed from user input that results in many tokens (in most extreme case some clients send double url-encoded strings which result in many small tokens -- letters and numbers) but we don't know what's the real bottleneck.

Attachments:

Thanks in advance for any ideas how to improve our ES 5.2 cluster performance.


(Adrien Grand) #2

The queries sample leads to a 404 for me.


(Tomasz Elendt) #3

@jpountz: Sorry, I've just corrected the link.


(Adrien Grand) #4

Are you able to identify which part of the query causes the issue by sometimes disabling aggregations and sometimes your custom scoring?

If the issue is mostly with running the query, it could be intersting to run the validate API on both versions with rewrite=true to see what the generated Lucene query is, and see whether they differ.


(Tomasz Elendt) #5

@jpountz: Here are the query rewrites.

When it comes to disabling aggregations/filters/custom scoring -- I'll test it today.


(Tomasz Elendt) #6

Removing aggregations/filters/custom scoring (one thing at the time) did not help :frowning:

It's either query rewrite/execution difference or something wrong with our config (which I think is very unlikely, since we haven't changed much).


#7

Hi,

I was about to start my own topic when I read yours. We are having similar issues with percolator performance. Our original app runs on ES 1.5; our new app will run on ES 5.x. We have somewhat complex queries (nesting, scripts, etc.) and large (complex as well) documents. But both queries and documents are the same (or nearly the same) on 1.5 and 5.3. Both ES instances are running on AWS.

The problem is that 1.5 percolator is 3-4 times faster than 5.3 one. What’s more is that beefing up hardware doesn’t help much. 5.3 percolator running on 2 cores and 7.5 GB is only 5-10% slower than the same one running on 16 cores and 60 GB. We follow all ES’s recommendations with JVM heap size (~50% of available memory).

Is it possible that outlawing “now” range queries and forcing to replace them with scripts has anything to do with performance degrading?

In this particular app we only have about 11K queries, so going from 60ms to 160ms may not be the end of the world, but we are planning to use 5.x percolator for 100 times more queries and since percolator performance is liner to number of queries that might make it impossible.

We would appreciate any response from ES folks to our problem (perceived or real).

Thank you,

Yuri


(Adrien Grand) #8

Do you mean compared to 1.7 or compared to the query that has aggregations and custom scoring? I am asking because one important performance factor of queries is the number of documents that they match, so if you remove filters, this ends up increasing the number of matches, which in-turn increases the query time. It would be more interesting to compare to 1.7.

Did you have nested docs in your 1.7 index that you removed in your 5.x index? (Just curious, if that's the case, it should not hurt performance, it should help actually :s)

There is something interesting though in the way your queries are rewritten. We made efforts recently in order to make query parsing smarter in the presence of multi-term synonyms. In your case, multi-term synonyms seem to be created implicitly by a shingle filter. For instance with the query on dondurma gibisin is parsed as (dondurma gibisin OR dondurma_gibisin) AND gibisin on 1.7 and (dondurma AND gibisin) OR dondurma_gibisin on 5.x. I think it is better now, but an unfortunate consequence of this change is that we are comparing apples to oranges since the parsed queries are different.

It seems to me some of your queries have a filter that looks like this:

{ 
  "bool": {
    "must": // some queries
    "filter": {
      "bool": {
        "must_not": [
          {
            "term": { "htc" : "TR" }
          },
          {
            "term": { "bcnm" : "TR" }
          }
        ]
      }
    }
  }
}

Could you rewrite them so that they look like this instead? (putting everything on the same level)

{ 
  "bool": {
    "must": // some queries
    "must_not": [
      {
        "term": { "htc" : "TR" }
      },
      {
        "term": { "bcnm" : "TR" }
      }
    ]
  }
}

The percolator is a different issue I think. It was rewritten in order to not hold all queries in memory by indexing terms that occur in those queries to be able to efficiently filter some non matching queries. Please open a separate thread and share what your typical percolator queries look like. Please also let us know whether this is an upgraded index or if your reindexed in 5.x.


#9

Thank you, I started it at https://discuss.elastic.co/t/5-x-percolator-performance/82437


(Tomasz Elendt) #10

Sorry for late response.

The later. I know filters may have implication on search performance.
I've just performed a load test with filters switched off, both in ES 1.7 and 5.2. I see small drop in maximum number of queries each cluster can take -- 1.7 can handle ~340 reqs/s and 5.2 has problems with handling 80 reqs/s.

Unfortunately I don't understand this part. But if this rewrite is responsible for our performance degradation, then I would be more that happy to get the old behavior back.

We removed nested docs (and nested filters) before, that's why you don't see them in queries that I shared (neither in ES 1.7 nor in ES 5.2).

This is a bit problematic, since those term filters (htc & bcnm) are intended for filtering. I know, must_not is always executed in filter context, but sometimes they appear in must and in order to force them to be executed in filter context we always place them in filter. Refactoring that would take quite some time. And I don't have a big hopes, since ES 5 is at least 4 times slower that ES 1.7 with all filters switched off completely.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.