Elasticsearch 7.8 worse heap management

Carlos_Moya · July 27, 2020, 2:11pm

Hello Ignacio,

I'm sorry, I was out of the office last week and was unable to attend this issue.

The strange thing is that the cluster doesn't seem to be under pressure. The query time is simply worse than the old version under all conditions. Maybe some tweaks related to memory vs disk usage introduced in 7.8.0 are affecting us?

Comparing your version.properties we can see the next mayor differences:

7.6.2 :

elasticsearch = 7.6.3
lucene = 8.4.0
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 13.0.2+8

7.8.0 :

elasticsearch = 7.8.1
lucene = 8.5.1
bundled_jdk_vendor = adoptopenjdk
bundled_jdk = 14.0.1+7

Just now I'm configuring the full cluster to use openjdk 13 instead of the bundled JDK to verify if the issue is there. I Will report here the new data.

Regards,
Carlos

Ignacio_Vera · July 27, 2020, 2:16pm

We are more aggressively using mmap to open index files instead of using non blocking i/o. That is the reason I was asking for pagefaults metrics as that might be an indication that those changes might be affecting your setup.

Ignacio_Vera · July 27, 2020, 2:20pm

One thing we could look at is the limits on mmap counts you currently have:

https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html

Carlos_Moya · July 27, 2020, 2:31pm

Hello ignacio,

We installed elasticsearch with the .deb packages so our current mmap limit config is the default configured by the package:

# sysctl vm.max_map_count
vm.max_map_count = 262144

I'm looking for the pagefaults metric, I'm not sure that our prometheus exporters are collecting it. Will send you the data I'll found.

If this is the case and we are being affected for this change, is there a way to configure mmap like the 7.6.2 version?

Thanks,
Carlos

Ignacio_Vera · July 28, 2020, 12:10pm

Hi Carlos,

The only way you can affect the way Elasticsearch is opening files is by using the store API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html

Steve_Mushero · July 29, 2020, 6:33am

If you want to monitor mmapped file count, and/or compare across versions, etc. they are split out in stats (a bit confusing) as regular file handles and mmap'd files are separate; in your node stats, you can find:

process.open_file_descriptors - Regular files
jvm.buffer_pools.mapped.count - Memory mapped files

In lsof these are seen together if you just do a count of open files for a process, though they can be separated. )

slabtop can also shows memory mapped count these as vm_area_struct, but system-wide and the OS uses a lot, like 10K in our case - that is the key count to compare to the system-wide vm.max_map_count though, and can easily exceed 64K, which is why the default on many systems has to be raised.

Not sure any of this helps, but FYI and if you want to see how many mapped files are in use. Suggest you get the pagefaults metrics that Ignacio wants.

hoppy-kamper · July 29, 2020, 4:23pm

I also noticed an increase in query latency after upgrading from 7.5.2 to 7.7.1 without any CPU use or memory pressure (on Elastic Cloud)

I'm wondering if moving the terms index from heap to disk is the culprit.

Carlos_Moya · July 30, 2020, 2:47pm

Hi @Ignacio_Vera,

Thanks for your response, also thanks to @Steve_Mushero and @hoppy-kamper. I really appreciate this kind of feedback from the ES community

I'm here with new data:
The cluster is now with the OpenJDK 13, with this change the memory usage seems more stable:

Heap usage view of all servers:

heap_all1839×415 790 KB
Heap usage view filtered to one server:

heap_filtered1842×410 75.8 KB

Unfortunately, it didn't solve the loss of performance due to the update, so we can discard the major JDK update included in 7.8.0 as the reason for it.

I've also enabled more monitoring in the cluster to save the pagefaults metric, here's what I see when our data analytics team is extracting data total pagefaults for every 60s:

Pagefaults:

pgfault1850×627 236 KB
Major pagefaults:

major_pgfault1849×613 105 KB

I don't have pagefaults metrics from the 7.6.2 version.

The job from the data analytics team spends 365 min in 7.8.0 vs 231 min in 7.6.2 so it's a great loss of performance. I saw that 7.8.1 is now available. Do you think it could solve the issue? Could it be related to the change mentioned by @hoppy-kamper ?

Thanks in advance,
Carlos

Steve_Mushero · July 31, 2020, 1:55am

Wow, huge change going to JVM 13 vs. the bundled 14 - beyond my knowledge on why, or what options - maybe the GC changes.

Carlos_Moya · July 31, 2020, 9:52am

Hello @Steve_Mushero ,

Yes, as @xeraa said in one of the first posts, this JDK change switched the garbage collector from CMS to G1.

I was hoping that rolling back the JDK would fix the performance issue, but it doesn't

Carlos_Moya · August 4, 2020, 2:46pm

Cluster updated to 7.8.1 with same performance.

Jathin · August 12, 2020, 1:05pm

It isn't really obvious.. Did 7.8.1 fixed this issue or are you seeing same performance as 7.8.0

Carlos_Moya · August 13, 2020, 7:28am

Hi @Jathin,

I see the same performance as 7.8.0

cpmoore · August 16, 2020, 2:02am

Out of curiosity, how big is your cluster in terms of document count and storage size?

Carlos_Moya · August 24, 2020, 7:47am

Hi cpmoore, you can see it in the first screen of this message:

reneklacan · September 2, 2020, 12:41pm

We've been also having big performance issues on 7.8.1. Cluster was set up from scratch on Kubernetes with official Helm chart. We imported the same data we have in ES 6.8 and even with doubled resources we are not able to handle same amount of load as before.

Performance degradation is mainly manifested with heavily slowed down indexing times (rest bulk requests are taking up to 30 seconds vs standard few ms). But response time of http we are seeing from our web app is not fully reflected in increases indexing times shown in Stack Monitoring though.

Our 6.8 cluster has 3 master and 6 data nodes (16GB ram, 10GB heap, 4 CPUs, 300M documents, 36 indices, 150 shards, heavy indexing)

EDIT: We are also seeing increased CPU usage (2x-4x) with 7.8.1 like described in High OS Cpu usage on 7.7.1

reneklacan · September 23, 2020, 3:25pm

We ended up upgrading just to 7.6.2 which is working great.

Has anybody been successful to fix these issues on higher versions of ES?

And is this issue something that's going to be addressed in upcoming releases?

/cc @Steve_Mushero @Ignacio_Vera

Carlos_Moya · September 23, 2020, 4:55pm

I'm afraid that the lower performance for our use case "it's not a bug, it's a feature", that's my impression

reneklacan · September 23, 2020, 5:19pm

@Carlos_Moya I'd think that might be the case if performance was few % lower but if cluster is literally unstable with the same data and double resources then I feel there must be some bigger underlying issue. But I guess as long as just few people are affected, there's no pointing in investigating this further on your side.

Kim-Kruse-Hansen · September 23, 2020, 5:50pm

We upgraded from 7.6.2 to 7.8.1 a month ago. On day 1 , we experienced long GC times, 15-20 seconds+ , which we have not seen for a long, long time. This occurs now and then still , maybe a few times during a week. As far as we can tell , 7.8.x is doing substantially more IO than previous versions. We don't however see any decrease in ingestion throughput.

On a side note , I have learned that 7.9.x contains some fix to the G1GC , so we are currently awaiting next version of 7.9 , which we hope will help our GC issues.

Enhance real memory circuit breaker with G1 GC by henningandersen · Pull Request #58674 · elastic/elasticsearch · GitHub

Topic		Replies	Views
Garbage collection Elasticsearch	13	8310	July 6, 2017
Memory problems Elasticsearch	27	1341	July 6, 2017
Sudden Unexplained CPU Usage Elasticsearch	17	464	July 6, 2017
Regarding upgrading Elastic Search server from 0.18.3 Elasticsearch	8	365	July 6, 2017
Es got blocked Elasticsearch	13	933	July 6, 2017

Elasticsearch 7.8 worse heap management

Related topics