Elasticsearch 7.8 worse heap management

Hello Ignacio,

I'm sorry, I was out of the office last week and was unable to attend this issue.

The strange thing is that the cluster doesn't seem to be under pressure. The query time is simply worse than the old version under all conditions. Maybe some tweaks related to memory vs disk usage introduced in 7.8.0 are affecting us?

Comparing your version.properties we can see the next mayor differences:

  • 7.6.2 :
elasticsearch = 7.6.3
lucene = 8.4.0
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 13.0.2+8
  • 7.8.0 :
elasticsearch = 7.8.1
lucene = 8.5.1
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 14.0.1+7

Just now I'm configuring the full cluster to use openjdk 13 instead of the bundled JDK to verify if the issue is there. I Will report here the new data.

Regards,
Carlos

We are more aggressively using mmap to open index files instead of using non blocking i/o. That is the reason I was asking for pagefaults metrics as that might be an indication that those changes might be affecting your setup.

One thing we could look at is the limits on mmap counts you currently have:

https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html

Hello ignacio,

We installed elasticsearch with the .deb packages so our current mmap limit config is the default configured by the package:

# sysctl vm.max_map_count
vm.max_map_count = 262144

I'm looking for the pagefaults metric, I'm not sure that our prometheus exporters are collecting it. Will send you the data I'll found.

If this is the case and we are being affected for this change, is there a way to configure mmap like the 7.6.2 version?

Thanks,
Carlos

Hi Carlos,

The only way you can affect the way Elasticsearch is opening files is by using the store API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html

If you want to monitor mmapped file count, and/or compare across versions, etc. they are split out in stats (a bit confusing) as regular file handles and mmap'd files are separate; in your node stats, you can find:

process.open_file_descriptors - Regular files
jvm.buffer_pools.mapped.count - Memory mapped files

In lsof these are seen together if you just do a count of open files for a process, though they can be separated. )

slabtop can also shows memory mapped count these as vm_area_struct, but system-wide and the OS uses a lot, like 10K in our case - that is the key count to compare to the system-wide vm.max_map_count though, and can easily exceed 64K, which is why the default on many systems has to be raised.

Not sure any of this helps, but FYI and if you want to see how many mapped files are in use. Suggest you get the pagefaults metrics that Ignacio wants.

1 Like

I also noticed an increase in query latency after upgrading from 7.5.2 to 7.7.1 without any CPU use or memory pressure (on Elastic Cloud)

I'm wondering if moving the terms index from heap to disk is the culprit.

1 Like

Hi @Ignacio_Vera,

Thanks for your response, also thanks to @Steve_Mushero and @hoppy-kamper. I really appreciate this kind of feedback from the ES community :smiley:

I'm here with new data:
The cluster is now with the OpenJDK 13, with this change the memory usage seems more stable:

Unfortunately, it didn't solve the loss of performance due to the update, so we can discard the major JDK update included in 7.8.0 as the reason for it.

I've also enabled more monitoring in the cluster to save the pagefaults metric, here's what I see when our data analytics team is extracting data total pagefaults for every 60s:

I don't have pagefaults metrics from the 7.6.2 version.

The job from the data analytics team spends 365 min in 7.8.0 vs 231 min in 7.6.2 so it's a great loss of performance. I saw that 7.8.1 is now available. Do you think it could solve the issue? Could it be related to the change mentioned by @hoppy-kamper ?

Thanks in advance,
Carlos

Wow, huge change going to JVM 13 vs. the bundled 14 - beyond my knowledge on why, or what options - maybe the GC changes.

Hello @Steve_Mushero ,

Yes, as @xeraa said in one of the first posts, this JDK change switched the garbage collector from CMS to G1.

I was hoping that rolling back the JDK would fix the performance issue, but it doesn't :frowning:

Cluster updated to 7.8.1 with same performance.

It isn't really obvious.. Did 7.8.1 fixed this issue or are you seeing same performance as 7.8.0