I've upgraded an ElasticSearch server running on Ubuntu 18 from version 7.5.1 to 7.9.3. I use ES, Kibana and Logstash on this server. The load was immediately much higher after the upgrade and this has persisted. The load is about double what it was before the upgrade.
When I looked into Management -> Stack Monitoring in Kibana I could see that the search/rate for the .kibana_4 index was ~1100/s whereas the next busiest index is only 5/s and after that 0.5/s. This is a small system, there's one node, no replicas and about 150GB of indices. These are time-based indices (one per day). There's 722 indices and 1393 shards. There's 12GB RAM on the server and 4 vCPUs. I've allocated a heap of 4GB to ES. There's only one person querying Kibana - me. Based on this seemingly high search rate I shut down Kibana and the load immediately fell to the level it was before the upgrade. Cluster health is appearing as green/A-OK whether Kibana is running or not. The size of the .kibana_4 index is only 10MB (60 documents). After the upgrade I experienced an error message in the log relating to the .kibana_4 index when starting Kibana . So I deleted the index and restarted Kibana, it had been stuck on the error for many minutes prior to doing this.
I deleted the .kibana_4 index as it appeared to contain images, HTML tags and settings (such as saved searches). I also deleted all the .kibana_N indices. On restarting Kibana it created a .kibana_2 index which is currently the one with the highest search rate.
I had already been using the hot threads API without much success. If I inspect the hot threads API I don't glean anything (i) because sometimes it returns nothing (ii) it seems to return different things each time (iii) the output is, to me, quite inscrutable. Here's a couple of attempts.
curl -X GET "http://127.0.0.1:9200/_nodes/hot_threads?pretty"
::: {syd-elk-02}{sa1xm9R2Q96rQWMR1Jy0Eg}{Tawlyj87QNmdQDBjtV0UMQ}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=12586774528, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
Hot threads at 2021-01-17T23:22:03.045Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
14.7% (73.7ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][search][T#5]'
6/10 snapshots sharing following 34 elements
app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:240)
app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:244)
app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:229)
app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:226)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:199)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$22(IndicesService.java:1393)
app//org.elasticsearch.indices.IndicesService$$Lambda$5085/0x0000000801812060.accept(Unknown Source)
app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$23(IndicesService.java:1445)
app//org.elasticsearch.indices.IndicesService$$Lambda$5086/0x00000008018127c0.get(Unknown Source)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:176)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:159)
app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:125)
app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1451)
app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1390)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:360)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
app//org.elasticsearch.search.SearchService.access$200(SearchService.java:136)
app//org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:396)
app//org.elasticsearch.search.SearchService$2$$Lambda$5064/0x0000000801808040.get(Unknown Source)
app//org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:412)
app//org.elasticsearch.search.SearchService$$Lambda$5065/0x0000000801808268.run(Unknown Source)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@15/java.lang.Thread.run(Thread.java:832)
4/10 snapshots sharing following 2 elements
java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@15/java.lang.Thread.run(Thread.java:832)
8.5% (42.3ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#1]'
3/10 snapshots sharing following 60 elements
app//org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:115)
org.elasticsearch.xpack.security.rest.SecurityRestFilter.handleRequest(SecurityRestFilter.java:81)
app//org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:236)
app//org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:318)
app//org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:176)
app//org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:318)
app//org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:372)
app//org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:308)
org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:42)
org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:28)
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:58)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.base@15/java.lang.Thread.run(Thread.java:832)
4/10 snapshots sharing following 9 elements
java.base@15/sun.nio.ch.EPoll.wait(Native Method)
java.base@15/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
java.base@15/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:129)
java.base@15/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:146)
io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.base@15/java.lang.Thread.run(Thread.java:832)
6.5% (32.5ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#4]'
6/10 snapshots sharing following 3 elements
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.base@15/java.lang.Thread.run(Thread.java:832)
curl -X GET "http://127.0.0.1:9200/_nodes/hot_threads?pretty"
::: {syd-elk-02}{sa1xm9R2Q96rQWMR1Jy0Eg}{Tawlyj87QNmdQDBjtV0UMQ}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=12586774528, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
Hot threads at 2021-01-17T23:26:56.850Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
7.6% (37.7ms out of 500ms) cpu usage by thread 'elasticsearch[syd-elk-02][transport_worker][T#1]'
10/10 snapshots sharing following 3 elements
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
java.base@15/java.lang.Thread.run(Thread.java:832)
I realise that my shard allocation may be less than optimal, but there was presumably the same before the upgrade without experiencing the current issue. Once the current issue was resolved I was going to roll up all the daily indices into a single monthly index (using _reindex) in order to minimise shard allocation. I'd experimented with this on one month and it had that effect, but did not reduce the load appreciably. It also didn't improve the search performance for that month's data.
I've done some reindexing operations to take multiple, small daily indices and combine them into weekly or monthly indices, as I understand this is more efficient. I've gone from 722 indices and 1393 shards to 83 indices and 147 shards, but have still retained all of the data. The load average is much improved, so this reindexing work (or writing to weekly/monthly indices in the first place) is something I'll take forward in the system.
However. The 5 minute load average before the upgrade was was 1.2 and now it's 1.7. So whilst I've seen an improvement, there's still been a 40% increase in load as a consequence of this upgrade. I don't think that's progress. As a sysadmin, when I see performance suddenly take a hit as a result of an upgrade it's not something that sits well with me and, IMHO, shouldn't sit well with the product developers. To just lose capacity for no reason is troubling. And, bear in mind, the product was quite happy on 1.2 load average previously with my inefficiently arranged indices, so the performance hit is presumably, strictly-speaking, a lot worse than 40%.
Simon
p.s. These screenshots show the impact of my reindexing work on the system.
Subsequent to my last post (where I'd been busily consolidating daily indices into weekly or monthly indices and seeing some improvement in performance, but not back to the load levels prior to the upgrade) I can report that after a few days of me leaving the system alone the load suddenly dropped quite dramatically. Now the load is significantly lower than it was before the upgrade.
So the reindexing (merging/consolidating) of multiple indices into just a few appears to have been successful. And I can only assume that, after that work, some background task must have been running (post upgrade or post reindex) that finally finished and reduced the load to it's current levels.
The load on 7.5.1, prior to the upgrade, was ~1.2. Now the load (the last drop at the end of the graph in Week 04) is ~0.5. I'll take that kind of outcome any day. But what sits uncomfortably with me is the opaqueness of what just occurred, the lack of insight?
The segment count rose from ~2000 to ~2100 from Jan 23rd to Jan 28th. Then it dropped sharply back to ~2000 on the 28th. This final drop seems to coincide with the load reduction. I'm interested to know what was happening in the background and how I should have been able to diagnose and track behaviour during this time? Otherwise it's a mysterious black box. Also whilst the segment count changed, these weren't orders of magnitude scale changes (i.e. a delta of ~8% in segments) but a drop of ~60% in load average?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.