Performance tuning advisory needed

asp · April 4, 2017, 11:05am

Hi all,

we came from nagios logserver with a single node. It crashed quite often, but acted much faster than our current sinlge node instance on pure ELK 5.1.2.

Once we migrated we also changed our index / shard allocation. With logserver it was only possible to have one index per day which hold all types with 5 shards per index.

With ELK 5.1.2 we changed that. During developement everything seemed to be fine, but now we encounter big performance issues. Dashboards are often running into 30s timeout.

So I need your thoughts and advisory to tune the configuration and to know if and how I should upgrade the underlying VM. If possible we want to stay on a single machine, because the project is short on budget

shard allocation:

First some information about our current shard allocation:

health status index                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   tux-prod-2017.03.30              C-vON4hRQBKuGyyEADhhaw   5   1    8091818            0        9gb            9gb
yellow open   perf-staging-2017.03.31          9xj1n8s2SoydFP7Nf2sxkQ   3   1    1775777            0   1000.2mb       1000.2mb
yellow open   tux-staging-2017.03.31           fq1l1T-vQPSkInwL7mLKiw   5   1     318983            0    205.1mb        205.1mb
yellow open   perf-prod-2017.03.30             LozcJsYCQf28OwbS569_8Q   3   1    1137168            0    695.5mb        695.5mb
yellow open   perf-staging-2017.03.30          rtz5dZ8qR2KyZGnPM4iAbQ   3   1    1791326            0   1015.1mb       1015.1mb
yellow open   other-prod-2017.03.31            i2mumZn3TyKQV1RtM9XzHA   3   1     242674            0      111mb          111mb
yellow open   other-staging-2017.03.30         asFS-clSSvacRsUxbcW51A   3   1      19028            0     26.4mb         26.4mb
yellow open   perf-prod-2017.03.31             k1FYSgDEScia9iN88hxujw   3   1    1145884            0    697.9mb        697.9mb
yellow open   perf-no_stage_defined-2017.03.30 4i42u1CsREmOZPeKPRSiXw   3   1     126108            0    108.2mb        108.2mb
yellow open   tux-staging-2017.03.30           WKGfCXseR5KOwhF67UlZPg   5   1     351193            0    236.9mb        236.9mb
yellow open   perf-no_stage_defined-2017.03.31 PAXTC2FXQcynRwZwy1zkrA   3   1     127356            0    111.9mb        111.9mb
yellow open   tux-prod-2017.03.31              HX4IBwozStCtecQO8NT49A   5   1    7996508            0      8.9gb          8.9gb
yellow open   other-staging-2017.03.31         rOep5GTET5KBRoWRVyTIGQ   3   1      18229            0     23.9mb         23.9mb
yellow open   other-prod-2017.03.30            uwk2AEt1RRGKF6YTBpHIWQ   3   1     241231            0    108.9mb        108.9mb

We need to keep data of the last 40 days. staging indizes can be deleted after 14 days if needed.

What do you think about the shard allocation? too few / too much?
On a normal day, if our system (log producer) runs smooth, we got about 800k events per hour as peak. If the system has problems, we may have up to 1.1 mio events per hour.

vm resources

we are using a single node on a VM:
7 CPU
28 GB RAM
500GB storage
Heap settings
Elasticsearch: 14 GB
Logstash: 1 GB

so maybe something is misconfigured here?

use cases
our usecases are:

index logfiles to make them accessible via kibana
use multiple dashboards in kibana to monitor and analyze logfiles of our systems
in our dashboards we use aggregations like min / max / avg / percentile / term / filter
minutely automatic monitoring of our system via elasticsearch data analyzing aggregations of the last minutes.
we have about 20 probes running each minute
users
we have about 5 to 15 concurrent users on the system

Any help is really appreciated.
Thanks, Andreas

asp · April 12, 2017, 10:44am

Can anyone help please?

Christian_Dahlqvist · April 12, 2017, 12:42pm

I think it looks like you have too many shards. Most of the indices you have listed, if not all, could be setup with a single primary shard. Have you monitored the node when queries are slow to see what is limiting you? Is CPU saturated or is it perhaps disk I/O?

asp · April 12, 2017, 1:07pm

I took a look at top:
IO wait is going up to 30-80% I/O wait.
Load is increasing from 1 to 12. Elasticsearch is going up to 500% CPU.
But CPU seems higher, if I retry the query later when it is cached, while wait is reduced then.

got these erros in in es log:

org.elasticsearch.transport.RemoteTransportException: [node-1][127.0.0.1:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$6@37cbcdc7 on EsThreadPoolEx
ecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ba2155f[Running, pool size = 13, active threads = 13, queued tasks = 1000, co
mpleted tasks = 20401539]]

And I got many lines about garbage collection in logs:

[2017-04-12T15:05:40,301][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706034] overhead, spent [370ms] collecting in the last [1s]
[2017-04-12T15:05:58,323][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706051] overhead, spent [303ms] collecting in the last [1s]
[2017-04-12T15:06:03,412][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706056] overhead, spent [291ms] collecting in the last [1s]
[2017-04-12T15:06:09,901][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706062] overhead, spent [334ms] collecting in the last [1s]
[2017-04-12T15:06:19,401][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706071] overhead, spent [268ms] collecting in the last [1s]
[2017-04-12T15:06:26,943][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706078] overhead, spent [379ms] collecting in the last [1.3s]

asp · April 18, 2017, 11:24am

OK, I will try to reduce the shard size for next days. I will go on 1 index per shard, tux-prod will get 2 shards.

I keep you updated if / how the image is changing

system · May 16, 2017, 11:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance degradation Elasticsearch	9	578	May 9, 2020
Shards per CPU Elasticsearch	5	4115	July 5, 2017
Advice for restore/configuration Elasticsearch	6	521	April 7, 2019
Best config for Performance Elasticsearch	11	2496	November 5, 2018
Setting up Multi-node Architecture of ELK for log monitoring Elasticsearch	6	702	June 10, 2019

Performance tuning advisory needed

Related topics