High load after upgrading ElasticSearch/Kibana/Logstash from 7.5.1 to 7.9.3

Hello

I've upgraded an ElasticSearch server running on Ubuntu 18 from version 7.5.1 to 7.9.3. I use ES, Kibana and Logstash on this server. The load was immediately much higher after the upgrade and this has persisted. The load is about double what it was before the upgrade.

When I looked into Management -> Stack Monitoring in Kibana I could see that the search/rate for the .kibana_4 index was ~1100/s whereas the next busiest index is only 5/s and after that 0.5/s. This is a small system, there's one node, no replicas and about 150GB of indices. These are time-based indices (one per day). There's 722 indices and 1393 shards. There's 12GB RAM on the server and 4 vCPUs. I've allocated a heap of 4GB to ES. There's only one person querying Kibana - me. Based on this seemingly high search rate I shut down Kibana and the load immediately fell to the level it was before the upgrade. Cluster health is appearing as green/A-OK whether Kibana is running or not. The size of the .kibana_4 index is only 10MB (60 documents). After the upgrade I experienced an error message in the log relating to the .kibana_4 index when starting Kibana . So I deleted the index and restarted Kibana, it had been stuck on the error for many minutes prior to doing this.

Anybody have any ideas?

Simon

Welcome to our community! :smiley:

That's not really efficient, you should look to reduce that to at least 25% of the current value as it's likely wasting resources.

Is the load still happening? It could be Kibana running it's migration tasks, so it might have dropped back down.

If not, can you post the output from this API - https://www.elastic.co/guide/en/elasticsearch/reference/7.10/cluster-nodes-hot-threads.html.

Hello Mark

The load is still as high. If I run top (and expose threads) I almost invariably see the G1 Conc#0 thread at the top of the list.

top - 10:14:17 up 35 days,  2:25,  2 users,  load average: 4.53, 4.54, 4.16
Threads: 508 total,   5 running, 441 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.9 us,  1.4 sy,  0.7 ni, 68.3 id,  0.7 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 12291772 total,   246944 free,  8336004 used,  3708824 buff/cache
KiB Swap:  2632700 total,  2137852 free,   494848 used.  3628236 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                      
24754 elastic+  20   0 89.169g 5.352g 515312 R 94.4 45.7   3296:34 G1 Conc#0                                                                                                    
30917 kibana    20   0 2569756 1.430g  19032 R 61.1 12.2 655:38.62 node                                                                                                         
24791 elastic+  20   0 89.169g 5.352g 515312 S 11.1 45.7 293:58.12 elasticsearch[s                                                                                              
30148 root      20   0   33984   3928   3136 R 11.1  0.0   0:00.04 top                                                                                                          
24755 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7  49:50.16 GC Thread#1                                                                                                  
24756 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7  49:50.58 GC Thread#2                                                                                                  
24757 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7  49:48.88 GC Thread#3                                                                                                  
24760 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7   6:19.73 VM Thread                                                                                                    
24799 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 174:45.71 elasticsearch[s                                                                                              
24801 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 174:35.15 elasticsearch[s                                                                                              
24804 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 260:16.81 elasticsearch[s                                                                                              
24817 elastic+  20   0 89.169g 5.352g 515312 R  5.6 45.7 270:43.37 elasticsearch[s                                                                                              
24818 elastic+  20   0 89.169g 5.352g 515312 R  5.6 45.7 264:30.39 elasticsearch[s                                                                                              
24854 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 149:44.66 elasticsearch[s                                                                                              
24865 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 174:09.70 elasticsearch[s                                                                                              
24960 elastic+  20   0 89.169g 5.352g 515312 S  5.6 45.7 174:34.67 elasticsearch[s

I deleted the .kibana_4 index as it appeared to contain images, HTML tags and settings (such as saved searches). I also deleted all the .kibana_N indices. On restarting Kibana it created a .kibana_2 index which is currently the one with the highest search rate.

I had already been using the hot threads API without much success. If I inspect the hot threads API I don't glean anything (i) because sometimes it returns nothing (ii) it seems to return different things each time (iii) the output is, to me, quite inscrutable. Here's a couple of attempts.

Continued from previous post ...

curl -X GET "http://127.0.0.1:9200/_nodes/hot_threads?pretty"
::: {syd-elk-02}{sa1xm9R2Q96rQWMR1Jy0Eg}{Tawlyj87QNmdQDBjtV0UMQ}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=12586774528, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
   Hot threads at 2021-01-17T23:22:03.045Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   14.7% (73.7ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][search][T#5]'
     6/10 snapshots sharing following 34 elements
       app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:240)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:244)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:229)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:226)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$22(IndicesService.java:1393)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5085/0x0000000801812060.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$23(IndicesService.java:1445)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5086/0x00000008018127c0.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:176)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:159)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:125)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1451)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1390)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:360)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
       app//org.elasticsearch.search.SearchService.access$200(SearchService.java:136)
       app//org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:396)
       app//org.elasticsearch.search.SearchService$2$$Lambda$5064/0x0000000801808040.get(Unknown Source)
       app//org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:412)
       app//org.elasticsearch.search.SearchService$$Lambda$5065/0x0000000801808268.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15/java.lang.Thread.run(Thread.java:832)
     4/10 snapshots sharing following 2 elements
       java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15/java.lang.Thread.run(Thread.java:832)
   
    8.5% (42.3ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#1]'
     3/10 snapshots sharing following 60 elements
       app//org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:115)
       org.elasticsearch.xpack.security.rest.SecurityRestFilter.handleRequest(SecurityRestFilter.java:81)
       app//org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:236)
       app//org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:318)
       app//org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:176)
       app//org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:318)
       app//org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:372)
       app//org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:308)
       org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:42)
       org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:28)
       io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:58)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
       io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
       io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
       io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
       io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)
     4/10 snapshots sharing following 9 elements
       java.base@15/sun.nio.ch.EPoll.wait(Native Method)
       java.base@15/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
       java.base@15/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129)
       java.base@15/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146)
       io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803)
       io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)
   
    6.5% (32.5ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#4]'
     6/10 snapshots sharing following 3 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)

curl -X GET "http://127.0.0.1:9200/_nodes/hot_threads?pretty"
::: {syd-elk-02}{sa1xm9R2Q96rQWMR1Jy0Eg}{Tawlyj87QNmdQDBjtV0UMQ}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=12586774528, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
   Hot threads at 2021-01-17T23:26:56.850Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    7.6% (37.7ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#1]'
     10/10 snapshots sharing following 3 elements
       io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       java.base@15/java.lang.Thread.run(Thread.java:832)

I realise that my shard allocation may be less than optimal, but there was presumably the same before the upgrade without experiencing the current issue. Once the current issue was resolved I was going to roll up all the daily indices into a single monthly index (using _reindex) in order to minimise shard allocation. I'd experimented with this on one month and it had that effect, but did not reduce the load appreciably. It also didn't improve the search performance for that month's data.

Simon

Hello

I've done some reindexing operations to take multiple, small daily indices and combine them into weekly or monthly indices, as I understand this is more efficient. I've gone from 722 indices and 1393 shards to 83 indices and 147 shards, but have still retained all of the data. The load average is much improved, so this reindexing work (or writing to weekly/monthly indices in the first place) is something I'll take forward in the system.

However. The 5 minute load average before the upgrade was was 1.2 and now it's 1.7. So whilst I've seen an improvement, there's still been a 40% increase in load as a consequence of this upgrade. I don't think that's progress. As a sysadmin, when I see performance suddenly take a hit as a result of an upgrade it's not something that sits well with me and, IMHO, shouldn't sit well with the product developers. To just lose capacity for no reason is troubling. And, bear in mind, the product was quite happy on 1.2 load average previously with my inefficiently arranged indices, so the performance hit is presumably, strictly-speaking, a lot worse than 40%.

Simon

p.s. These screenshots show the impact of my reindexing work on the system.


2 Likes

Subsequent to my last post (where I'd been busily consolidating daily indices into weekly or monthly indices and seeing some improvement in performance, but not back to the load levels prior to the upgrade) I can report that after a few days of me leaving the system alone the load suddenly dropped quite dramatically. Now the load is significantly lower than it was before the upgrade.

So the reindexing (merging/consolidating) of multiple indices into just a few appears to have been successful. And I can only assume that, after that work, some background task must have been running (post upgrade or post reindex) that finally finished and reduced the load to it's current levels.

The load on 7.5.1, prior to the upgrade, was ~1.2. Now the load (the last drop at the end of the graph in Week 04) is ~0.5. I'll take that kind of outcome any day. But what sits uncomfortably with me is the opaqueness of what just occurred, the lack of insight?

load

Simon

Did the segment count keep dropping during the last week>?

Yes and no :o)

The segment count rose from ~2000 to ~2100 from Jan 23rd to Jan 28th. Then it dropped sharply back to ~2000 on the 28th. This final drop seems to coincide with the load reduction. I'm interested to know what was happening in the background and how I should have been able to diagnose and track behaviour during this time? Otherwise it's a mysterious black box. Also whilst the segment count changed, these weren't orders of magnitude scale changes (i.e. a delta of ~8% in segments) but a drop of ~60% in load average?

Simon

Merging is computationally intensive.
Segments usually increase due to indexing changes in the index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.