Elasticsearch 7.8 worse heap management

Hello Ignacio,

I'm sorry, I was out of the office last week and was unable to attend this issue.

The strange thing is that the cluster doesn't seem to be under pressure. The query time is simply worse than the old version under all conditions. Maybe some tweaks related to memory vs disk usage introduced in 7.8.0 are affecting us?

Comparing your version.properties we can see the next mayor differences:

  • 7.6.2 :
elasticsearch = 7.6.3
lucene = 8.4.0
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 13.0.2+8
  • 7.8.0 :
elasticsearch = 7.8.1
lucene = 8.5.1
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 14.0.1+7

Just now I'm configuring the full cluster to use openjdk 13 instead of the bundled JDK to verify if the issue is there. I Will report here the new data.

Regards,
Carlos

We are more aggressively using mmap to open index files instead of using non blocking i/o. That is the reason I was asking for pagefaults metrics as that might be an indication that those changes might be affecting your setup.

One thing we could look at is the limits on mmap counts you currently have:

https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html

Hello ignacio,

We installed elasticsearch with the .deb packages so our current mmap limit config is the default configured by the package:

# sysctl vm.max_map_count
vm.max_map_count = 262144

I'm looking for the pagefaults metric, I'm not sure that our prometheus exporters are collecting it. Will send you the data I'll found.

If this is the case and we are being affected for this change, is there a way to configure mmap like the 7.6.2 version?

Thanks,
Carlos

Hi Carlos,

The only way you can affect the way Elasticsearch is opening files is by using the store API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html

If you want to monitor mmapped file count, and/or compare across versions, etc. they are split out in stats (a bit confusing) as regular file handles and mmap'd files are separate; in your node stats, you can find:

process.open_file_descriptors - Regular files
jvm.buffer_pools.mapped.count - Memory mapped files

In lsof these are seen together if you just do a count of open files for a process, though they can be separated. )

slabtop can also shows memory mapped count these as vm_area_struct, but system-wide and the OS uses a lot, like 10K in our case - that is the key count to compare to the system-wide vm.max_map_count though, and can easily exceed 64K, which is why the default on many systems has to be raised.

Not sure any of this helps, but FYI and if you want to see how many mapped files are in use. Suggest you get the pagefaults metrics that Ignacio wants.

1 Like

I also noticed an increase in query latency after upgrading from 7.5.2 to 7.7.1 without any CPU use or memory pressure (on Elastic Cloud)

I'm wondering if moving the terms index from heap to disk is the culprit.

1 Like

Hi @Ignacio_Vera,

Thanks for your response, also thanks to @Steve_Mushero and @hoppy-kamper. I really appreciate this kind of feedback from the ES community :smiley:

I'm here with new data:
The cluster is now with the OpenJDK 13, with this change the memory usage seems more stable:

Unfortunately, it didn't solve the loss of performance due to the update, so we can discard the major JDK update included in 7.8.0 as the reason for it.

I've also enabled more monitoring in the cluster to save the pagefaults metric, here's what I see when our data analytics team is extracting data total pagefaults for every 60s:

I don't have pagefaults metrics from the 7.6.2 version.

The job from the data analytics team spends 365 min in 7.8.0 vs 231 min in 7.6.2 so it's a great loss of performance. I saw that 7.8.1 is now available. Do you think it could solve the issue? Could it be related to the change mentioned by @hoppy-kamper ?

Thanks in advance,
Carlos

Wow, huge change going to JVM 13 vs. the bundled 14 - beyond my knowledge on why, or what options - maybe the GC changes.

Hello @Steve_Mushero ,

Yes, as @xeraa said in one of the first posts, this JDK change switched the garbage collector from CMS to G1.

I was hoping that rolling back the JDK would fix the performance issue, but it doesn't :frowning:

Cluster updated to 7.8.1 with same performance.

It isn't really obvious.. Did 7.8.1 fixed this issue or are you seeing same performance as 7.8.0

Hi @Jathin,

I see the same performance as 7.8.0 :frowning:

Out of curiosity, how big is your cluster in terms of document count and storage size?

Hi cpmoore, you can see it in the first screen of this message:

1 Like

We've been also having big performance issues on 7.8.1. Cluster was set up from scratch on Kubernetes with official Helm chart. We imported the same data we have in ES 6.8 and even with doubled resources we are not able to handle same amount of load as before.

Performance degradation is mainly manifested with heavily slowed down indexing times (rest bulk requests are taking up to 30 seconds vs standard few ms). But response time of http we are seeing from our web app is not fully reflected in increases indexing times shown in Stack Monitoring though.

Our 6.8 cluster has 3 master and 6 data nodes (16GB ram, 10GB heap, 4 CPUs, 300M documents, 36 indices, 150 shards, heavy indexing)

EDIT: We are also seeing increased CPU usage (2x-4x) with 7.8.1 like described in High OS Cpu usage on 7.7.1

1 Like

We ended up upgrading just to 7.6.2 which is working great.

Has anybody been successful to fix these issues on higher versions of ES?

And is this issue something that's going to be addressed in upcoming releases?

/cc @Steve_Mushero @Ignacio_Vera

I'm afraid that the lower performance for our use case "it's not a bug, it's a feature", that's my impression :frowning:

@Carlos_Moya I'd think that might be the case if performance was few % lower but if cluster is literally unstable with the same data and double resources then I feel there must be some bigger underlying issue. But I guess as long as just few people are affected, there's no pointing in investigating this further on your side. :frowning:

We upgraded from 7.6.2 to 7.8.1 a month ago. On day 1 , we experienced long GC times, 15-20 seconds+ , which we have not seen for a long, long time. This occurs now and then still , maybe a few times during a week. As far as we can tell , 7.8.x is doing substantially more IO than previous versions. We don't however see any decrease in ingestion throughput.

On a side note , I have learned that 7.9.x contains some fix to the G1GC , so we are currently awaiting next version of 7.9 , which we hope will help our GC issues.

Enhance real memory circuit breaker with G1 GC by henningandersen · Pull Request #58674 · elastic/elasticsearch · GitHub

1 Like