Performance problems when Upgrading from ElasticSearch 1.7.4 to 5.4.0

More follow up information:

We now have a fully loaded 5.4 cluster running with index.store.type = niofs. If we compare the speed of queries and related IO amounts to the fully loaded 1.7.4 cluster with index.store.type = fs (default), we find that 5.4 speed is almost as good as 1.7.4 and the IO is significantly lower than what it use to be when we had index.store.type = fs (which does mmapfs).

While this is good, there are quite a few articles talking about how mmapfs should be used if you are running on modern hardware: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

So we continue to try to figure out what configuration is causing the massive IO when we use mmapfs. It acts like it must be continually swapping pages in and out of memory from disk in some kind of memory/disk thrashing situation.

The configuration on our nodes is as follows:

  • 30.5GB of RAM
  • We are using instance store SSDs on i3.xlarge instances in AWS
  • ES JVM is allocated 16GB for heap space, leaving 14GB of space for the OS
  • While things are running, we can see page cache size in the top command is 13GB
  • We see around 300GB of virtual memory is associated with the JVM process
  • /usr/lib/systemd/system/elasticsearch.service -> LimitMEMLOCK=infinity
  • /etc/sysconfig/elasticsearch -> MAX_LOCKED_MEMORY=unlimited
  • /etc/sysconfig/elasticsearch -> MAX_MAP_COUNT=262144
  • /etc/sysconfig/elasticsearch -> MAX_OPEN_FILES=65536
  • /etc/elasticsearch/elasticsearch.yml -> bootstrap.memory_lock: true
  • ulimit -l, ulimit -v and ulimit -m are all unlimited

We’ve tried turning the memory lock settings off and on - it made no different in IO and performance. In all cases, IO is super high and performance is slow.

It seems like a lot of people have had problems with using mmapfs vs. niofs back in version 1.x and 2.x. Default settings worked fine back in 1.7 for us, but now mmapfs doesn’t work well at all in 5.4.

Any ideas on this? Is this a known issue with ElasticSearch/Lucene? Any ideas on settings we should look at? Should we give up and just use niofs? I’d like to see better performance - 5.4 is still slightly slower than 1.7.4

Thanks.

1 Like