Performance degradation after upgrading from v6.3.2 to v7.4.2

Hi, we have a cluster with hot and warm nodes. Our warm nodes have around 400 shards.
In 7.4.2, we see an increase of 10% in CPU usage in comparison to 6.3.2.
According to the hot threads call stacks, it is related to management tasks (see below).

 99.6% (497.7ms out of 500ms) cpu usage by thread 'elasticsearch[NODE-NAME][management][T#5]'
 2/10 snapshots sharing following 27 elements
   java.base@13.0.1/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149)
   java.base@13.0.1/sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
   java.base@13.0.1/java.nio.file.Files.readAttributes(Files.java:1842)
   java.base@13.0.1/java.nio.file.Files.getLastModifiedTime(Files.java:2393)

We were able to trace the reason for the management tasks to the _node/stats query that our exporter is executing.
Is there any change that could affect the performance of _node/stats in nodes with many shards that you know of?

Thanks!

No, I don't know of anything that might change this, but there have been many changes between 6.3.2 and 7.4.2 and stats calls are not usually performance-critical so their performance isn't tracked and it's definitely possible that their behaviour has changed. Some stats are certainly a little expensive or IO-intensive to compute. Do you need to request the stats so often that they would cause a performance impact? Do you need all the stats or would a (cheaper) subset suffice?

The exporter interval is every 1 minutes, so I wouldn't expect it to create a performance issue. However, when I remove the exporter and there are no stats call on the clusters the cpu drops. The cpu spike is only in warm nodes, the hot nodes are not affected so it seems to be correlated with the number of shards on the node.

It could take quite some resources to compute a full complement of stats for each node every minute, particularly if the warm nodes have slower disks. The fragment of a stack trace in your OP shows it is spending quite some time computing the last-modified time of some file or other, which could well be IO-bound.

@DavidTurner thanks for the quick response. We are seeing more performance changes in the cluster which seems to be related to mapping/bulk indexing changes.
We are using daily indexes with dynamic mappings so at the start of each day we have a lot of put mapping. In 6.2.4 this was never a big issue, for a few minutes the pending task will spike to up to 600 until most of the put mapping is over, In 7.4.2 we are seeing a much larger spike of 40,000 pending tasks that are causing the master to hang. The same data is being ingested to the cluster and the amount of nodes is equal.

The only thing that I can think of that is different, is that we copied indexes using a snapshot from version 6 to 7 and the old indexes had one mapping type.

It feels like the old version had some kind of limit that causes the nodes not to do all mapping in parallel and limit them to per thread request (by that throttling the pending task and protecting the master ).

Can you think of anything that changes in the bulk or put mapping area that could help us explain the new behavior we are seeing?

Thanks!

Yes, prior to 7.2.0 each mapping update would block one of the (limited number of) write threads preventing any other indexing activity until the master had responded, but since 7.2.0 these updates happen asynchronously:

It's normally wise to move most of your mappings into your index templates when working at scale. Dynamic mapping updates are an enemy of good performance since they cause indexing to bottleneck on the master.

That said, I'm a little surprised this makes the master hang since it processes all the pending mapping updates at once. Can you be more precise about the nature of this hang? Can you capture the hot threads on the master while it's "hanging", ideally repeatedly, and share them here?

3 Likes

Thanks! This is exactly the behavior we are seeing! Regarding the master, it isn't really a hang, the memory is increasing and the master becoming slow until crashing from OOM.

Ok I see, an OOM isn't how we want to respond at all. I opened #50670 to track this.

1 Like

Thanks! Here are the hot threads from the master:

::: {es-zzzzz-zzz}{iTwxmD5eTemoNyzkNXk7sw}{dUc57FHhStywNoi_HCWtyg}{10.2.79.215}{10.2.79.215:9300}{m}
Hot threads at 2020-01-06T08:56:43.551Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

93.3% (466.6ms out of 500ms) cpu usage by thread 'elasticsearch[xxx-xxx-zzz][masterService#updateTask][T#1]'
 2/10 snapshots sharing following 31 elements
   app//com.carrotsearch.hppc.AbstractObjectCollection.toArray(AbstractObjectCollection.java:78)
   app//org.elasticsearch.index.mapper.FieldMapper$MultiFields.toXContent(FieldMapper.java:611)
   app//org.elasticsearch.index.mapper.FieldMapper.doXContentBody(FieldMapper.java:428)
   app//org.elasticsearch.index.mapper.TextFieldMapper.doXContentBody(TextFieldMapper.java:903)
   app//org.elasticsearch.index.mapper.FieldMapper.toXContent(FieldMapper.java:388)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:514)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.Mapping.toXContent(Mapping.java:131)
   app//org.elasticsearch.index.mapper.DocumentMapper.toXContent(DocumentMapper.java:340)
   app//org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:88)
   app//org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:183)
   app//org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:335)
   app//org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:524)
   app//org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:406)
   app//org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:339)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:315)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238)
   app//org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)
   app//org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)
   app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)
   app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
   app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
   app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
   app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@13.0.1/java.lang.Thread.run(Thread.java:830)
 2/10 snapshots sharing following 28 elements
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:514)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:514)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.Mapping.toXContent(Mapping.java:131)
   app//org.elasticsearch.index.mapper.DocumentMapper.toXContent(DocumentMapper.java:340)
   app//org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:88)
   app//org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:183)
   app//org.elasticsearch.index.mapper.DocumentMapper.merge(DocumentMapper.java:322)
   app//org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:461)
   app//org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:406)
   app//org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:339)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:315)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238)
   app//org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)
   app//org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)
   app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)
   app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
   app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
   app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
   app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@13.0.1/java.lang.Thread.run(Thread.java:830)
 2/10 snapshots sharing following 25 elements
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:514)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:514)
   app//org.elasticsearch.index.mapper.ObjectMapper.toXContent(ObjectMapper.java:560)
   app//org.elasticsearch.index.mapper.Mapping.toXContent(Mapping.java:131)
   app//org.elasticsearch.index.mapper.DocumentMapper.toXContent(DocumentMapper.java:340)
   app//org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:88)
   app//org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:183)
   app//org.elasticsearch.index.mapper.DocumentMapper.merge(DocumentMapper.java:322)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:282)
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238)
   app//org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)
   app//org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)
   app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)
   app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
   app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
   app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
   app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@13.0.1/java.lang.Thread.run(Thread.java:830)
 4/10 snapshots sharing following 14 elements
   app//org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:238)
   app//org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)
   app//org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)
   app//org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)
   app//org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)
   app//org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
   app//org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
   app//org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
   app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@13.0.1/java.lang.Thread.run(Thread.java:830)

Downgrading to 7.1.1 solved the issue and is a workaround until the bug will be fixed

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.