Performance problems when Upgrading from ElasticSearch 1.7.4 to 5.4.0

(L) #1

We are upgrading our ElasticSearch 1.7.4 cluster to 5.4. We made mapping/code modifications so that everything is functionally working, but the performance is much worse than 1.7.4 (same exact cluster size, data size, and node types). With no load on the cluster, a query takes 10X the time it takes on 1.7.4. The worst performance seems to be related to queries that pull back a lot of records. In trying to improve performance - here are some things we’ve looked at:

  • We were seeing a fuzzy match query on a few fields that returned 5000 records go from 2-3 seconds on 1.7.4 to 30+ seconds on 5.4, so we thought possibly the new ABM25 similarity algorithm could be slowing things down. We changed the similarity algorithm to classic - no impact.

  • We double checked memory usage - we have 16GB allocated to each node across 16 nodes. This is the same as before. We compared memory usage during activity between 1.7.4 and 5.4 and its very similar

  • We tried queries using numeric since previously most of our fields were mapped as string and with 5.4 we auto-mapped them so they were text + keyword. Numeric based queries were also slower, but not quite as extreme.

  • We have also double checked that index files are memory mapped. We merged segments down to 5 per shard. We made sure mlockall was enabled. We confirmed no swapping is occurring at the OS level.

(For context, we have 1/2 billion to a billon records. The records are pretty complex - 100s of attributes in each record)

Has anybody noticed performance differences between older versions of ES and ES 5.x (with 5.x being slower)? Any recommendations on what to look at? Just ask if there are clarifications that would help with an answer.

Any thoughts would be greatly appreciated.


(Denis Gladkikh) #2

Is it possible that you are just experience cold caches? Have you tried to warm up the indices by running sample queries for a while and only after that compare the results?

(L) #3

Denis -

Thanks for the question. We have warmed up the cluster - running a slew of queries before running our test. We've also run tests where we have run the exact same queries over and over to watch the change in performance. Running the exact same query over and over several times does result in it going a bit faster... but instead of being 10X slower, it might be 8 or 9X slower than 1.7.4. Also - running a variety queries to warm up the cluster has minimal impact on our test queries unless the warm up queries were exactly the same.

We've also ran tests where we explicitly clear caches on the 1.7.4 cluster, run test queries there. Even with a cleared cache on 1.7.4, its much faster than a warmed up 5.4 cluster.

(Denis Gladkikh) #4

That is strange, but I assume that there are should be very simple explanation, considering so much difference (8-10x times).

Have you tried to run the queries with profiling enabled, I believe it is a new API in ES 2, so you will not be able to compare results with elastic search 1.7, but maybe it can get better picture what can slow down the queries in 5.x.

(L) #5

Yeah - we are really hoping for a simple explanation, but we have yet to find it :slight_smile:

We have tried profiling the queries on our new 5.4 test environment. Here is an example read out for one test query from one shard (we picked the slowest shard to show here...

"id": "[hbyLXwi5TkWyVETGjxq_pA][prospectprofiles][5]",
"searches": [
"query": [
"type": "BooleanQuery",
"description": "+wealth.indirect_assets.value:[7 TO 7] +wealth.business_ownership.value:[4 TO 4]",
"time": "1570.097449ms",
"time_in_nanos": 1570097449,
"breakdown": {
"score": 1955365,
"build_scorer_count": 27,
"match_count": 0,
"create_weight": 56642,
"next_doc": 2122316,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2223,
"score_count": 2195,
"build_scorer": 1565958680,
"advance": 0,
"advance_count": 0
"children": [
"type": "",
"description": "wealth.indirect_assets.value:[7 TO 7]",
"time": "739.3835410ms",
"time_in_nanos": 739383541,
"breakdown": {
"score": 359206,
"build_scorer_count": 27,
"match_count": 0,
"create_weight": 1657,
"next_doc": 29361,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 122,
"score_count": 2195,
"build_scorer": 738595905,
"advance": 392959,
"advance_count": 2108
"type": "",
"description": "wealth.business_ownership.value:[4 TO 4]",
"time": "827.9645740ms",
"time_in_nanos": 827964574,
"breakdown": {
"score": 384367,
"build_scorer_count": 27,
"match_count": 0,
"create_weight": 328,
"next_doc": 387643,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2101,
"score_count": 2195,
"build_scorer": 827144900,
"advance": 42851,
"advance_count": 161
"rewrite_time": 487905578,
"collector": [
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "4.124567000ms",
"time_in_nanos": 4124567,
"children": [
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "3.310050000ms",
"time_in_nanos": 3310050
"aggregations": []

The "took" time for this query was 15.6 seconds, but when we add up all the times recorded for each shard they add up to 5.3 seconds - and that's if each shard worked in serial which isn't the case. We weren't sure what to make of this. Most of the time in the "took" time can't be accounted for by looking at the times in the profile stats one each shard. Let me know if you have any ideas.


(L) #6

To follow up on the last comment. It seems that the profiling stats show that the slowness is not processing the query, but is somehow related to moving the data around after its found. I'm not sure if the changes to inter-node communication between ES 1.7.4 and 5.4 would have any impact. We are using AWS i3.xl nodes - so the network between them is pretty good.

(Denis Gladkikh) #7

I would probably start localizing the issue. You can use preference flags to try to search just on specific node. For example, if you are assuming that this is a transport - try to run this query on specific node with specific shard and call the query with preference=_shards:2,3|_primary as an example to exclude any routing. See if that will give you better result.
If it does - possible a bug in communication. Next options are look in sources, increase verbosity, submit a bug.

(L) #8

Thanks for the feedback. We took a deeper dive and found some more interesting insight.

First - we tried your suggestion just running the query on one shard. We found that did not run quickly, so that seemed to indicate our guess that the slowness was at the transport level was maybe not right.

We then shifted the cluster logging to TRACE and watched the timing as the query went through the 4 stages (query, fetch, expand, response) on each node. We found the query stage was fast - around 10 ms. The major slowness was in the fetch stage, but not on all nodes. We found fetch runs quickly (10-100ms) on many nodes and then a random one or two nodes will take 10-15 seconds to run the fetch stage. If we run the same query again - a different couple nodes will have a slow fetch stage. Once in a while all nodes run quickly and the query returns quickly.

The expand and response stages are pretty quick all the time.

So we are now drilling in to understand what could be causing the fetch stage to run slowly. Let me know if you have any more ideas.


(Denis Gladkikh) #9

Fetch is loading documents from the disk. 10-15 seconds is definitely a lot. How many documents are you returning with the Query?

Maybe look into strace to see if this is a system issue and maybe slow IO?

(L) #10

OK - we have dug deeper into the issue and found some strange things. We continued our investigation into disk IO. This does seem to be the core issue. We tracked # of reads / second and bytes read / second. The # of reads/second between 1.7.4 and 5.4.0 are not massively different, but the # of bytes read was 50X more with 5.4.0. some things to note about our tests:

  • The data on 1.7.4 and 5.4.0 is pretty well the same. We have around 500M records. Average record size in JSON form is around 13K. Average record on disk (total index / # of records) looks to be around 3-4K.

  • Our test query pulls back the first 1000 records

  • On both 1.7.4 and 5.4.0 we are using = mmapfs

  • Our nodes are running on AWS i3.xlarge with a modified version of Amazon’s Linux AMI version 2017.03

We decided to build a set of smaller test indexes with different We loaded an index with type = fs, type = mmapfs, type = niofs. We found both fs and mmapfs both had really high IO. This would make sense since it seems the documentation says that type=fs will pick mmapfs if your OS is 64 bit… Interestingly, type = niofs had much smaller IO.

We are in the process of loading our full index with type = niofs to see how that impacts performance overall. It would be good if niofs runs faster, but it makes no sense to us why mmapfs would have higher IO. We would expect the opposite.

We looked at some settings related to memory:

  • We have vm.max_map_count=262144
  • We give the ES JVM 16GB out of 30.5GB

NOTE - these settings are the same for 1.7.4 nodes.

Any ideas of why version 5.4.0 doesn’t work well with mmapfs file type or requires special settings that we haven’t figure out yet would be greatly appreciated.


(L) #11

More follow up information:

We now have a fully loaded 5.4 cluster running with = niofs. If we compare the speed of queries and related IO amounts to the fully loaded 1.7.4 cluster with = fs (default), we find that 5.4 speed is almost as good as 1.7.4 and the IO is significantly lower than what it use to be when we had = fs (which does mmapfs).

While this is good, there are quite a few articles talking about how mmapfs should be used if you are running on modern hardware:

So we continue to try to figure out what configuration is causing the massive IO when we use mmapfs. It acts like it must be continually swapping pages in and out of memory from disk in some kind of memory/disk thrashing situation.

The configuration on our nodes is as follows:

  • 30.5GB of RAM
  • We are using instance store SSDs on i3.xlarge instances in AWS
  • ES JVM is allocated 16GB for heap space, leaving 14GB of space for the OS
  • While things are running, we can see page cache size in the top command is 13GB
  • We see around 300GB of virtual memory is associated with the JVM process
  • /usr/lib/systemd/system/elasticsearch.service -> LimitMEMLOCK=infinity
  • /etc/sysconfig/elasticsearch -> MAX_LOCKED_MEMORY=unlimited
  • /etc/sysconfig/elasticsearch -> MAX_MAP_COUNT=262144
  • /etc/sysconfig/elasticsearch -> MAX_OPEN_FILES=65536
  • /etc/elasticsearch/elasticsearch.yml -> bootstrap.memory_lock: true
  • ulimit -l, ulimit -v and ulimit -m are all unlimited

We’ve tried turning the memory lock settings off and on - it made no different in IO and performance. In all cases, IO is super high and performance is slow.

It seems like a lot of people have had problems with using mmapfs vs. niofs back in version 1.x and 2.x. Default settings worked fine back in 1.7 for us, but now mmapfs doesn’t work well at all in 5.4.

Any ideas on this? Is this a known issue with ElasticSearch/Lucene? Any ideas on settings we should look at? Should we give up and just use niofs? I’d like to see better performance - 5.4 is still slightly slower than 1.7.4


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.