Performance tuning advisory needed


Hi all,

we came from nagios logserver with a single node. It crashed quite often, but acted much faster than our current sinlge node instance on pure ELK 5.1.2.

Once we migrated we also changed our index / shard allocation. With logserver it was only possible to have one index per day which hold all types with 5 shards per index.

With ELK 5.1.2 we changed that. During developement everything seemed to be fine, but now we encounter big performance issues. Dashboards are often running into 30s timeout.

So I need your thoughts and advisory to tune the configuration and to know if and how I should upgrade the underlying VM. If possible we want to stay on a single machine, because the project is short on budget :frowning:

shard allocation:

First some information about our current shard allocation:

health status index                            uuid                   pri rep docs.count docs.deleted store.size
yellow open   tux-prod-2017.03.30              C-vON4hRQBKuGyyEADhhaw   5   1    8091818            0        9gb            9gb
yellow open   perf-staging-2017.03.31          9xj1n8s2SoydFP7Nf2sxkQ   3   1    1775777            0   1000.2mb       1000.2mb
yellow open   tux-staging-2017.03.31           fq1l1T-vQPSkInwL7mLKiw   5   1     318983            0    205.1mb        205.1mb
yellow open   perf-prod-2017.03.30             LozcJsYCQf28OwbS569_8Q   3   1    1137168            0    695.5mb        695.5mb
yellow open   perf-staging-2017.03.30          rtz5dZ8qR2KyZGnPM4iAbQ   3   1    1791326            0   1015.1mb       1015.1mb
yellow open   other-prod-2017.03.31            i2mumZn3TyKQV1RtM9XzHA   3   1     242674            0      111mb          111mb
yellow open   other-staging-2017.03.30         asFS-clSSvacRsUxbcW51A   3   1      19028            0     26.4mb         26.4mb
yellow open   perf-prod-2017.03.31             k1FYSgDEScia9iN88hxujw   3   1    1145884            0    697.9mb        697.9mb
yellow open   perf-no_stage_defined-2017.03.30 4i42u1CsREmOZPeKPRSiXw   3   1     126108            0    108.2mb        108.2mb
yellow open   tux-staging-2017.03.30           WKGfCXseR5KOwhF67UlZPg   5   1     351193            0    236.9mb        236.9mb
yellow open   perf-no_stage_defined-2017.03.31 PAXTC2FXQcynRwZwy1zkrA   3   1     127356            0    111.9mb        111.9mb
yellow open   tux-prod-2017.03.31              HX4IBwozStCtecQO8NT49A   5   1    7996508            0      8.9gb          8.9gb
yellow open   other-staging-2017.03.31         rOep5GTET5KBRoWRVyTIGQ   3   1      18229            0     23.9mb         23.9mb
yellow open   other-prod-2017.03.30            uwk2AEt1RRGKF6YTBpHIWQ   3   1     241231            0    108.9mb        108.9mb

We need to keep data of the last 40 days. staging indizes can be deleted after 14 days if needed.

What do you think about the shard allocation? too few / too much?
On a normal day, if our system (log producer) runs smooth, we got about 800k events per hour as peak. If the system has problems, we may have up to 1.1 mio events per hour.

vm resources

  • we are using a single node on a VM:

  • 7 CPU

  • 28 GB RAM

  • 500GB storage

  • Heap settings

  • Elasticsearch: 14 GB

  • Logstash: 1 GB

so maybe something is misconfigured here?

use cases
our usecases are:

  • index logfiles to make them accessible via kibana
  • use multiple dashboards in kibana to monitor and analyze logfiles of our systems
  • in our dashboards we use aggregations like min / max / avg / percentile / term / filter
  • minutely automatic monitoring of our system via elasticsearch data analyzing aggregations of the last minutes.
  • we have about 20 probes running each minute
  • users
  • we have about 5 to 15 concurrent users on the system

Any help is really appreciated.
Thanks, Andreas


Can anyone help please?

(Christian Dahlqvist) #3

I think it looks like you have too many shards. Most of the indices you have listed, if not all, could be setup with a single primary shard. Have you monitored the node when queries are slow to see what is limiting you? Is CPU saturated or is it perhaps disk I/O?


I took a look at top:
IO wait is going up to 30-80% I/O wait.
Load is increasing from 1 to 12. Elasticsearch is going up to 500% CPU.
But CPU seems higher, if I retry the query later when it is cached, while wait is reduced then.

got these erros in in es log:

org.elasticsearch.transport.RemoteTransportException: [node-1][][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$6@37cbcdc7 on EsThreadPoolEx
ecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ba2155f[Running, pool size = 13, active threads = 13, queued tasks = 1000, co
mpleted tasks = 20401539]]

And I got many lines about garbage collection in logs:

[2017-04-12T15:05:40,301][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706034] overhead, spent [370ms] collecting in the last [1s]
[2017-04-12T15:05:58,323][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706051] overhead, spent [303ms] collecting in the last [1s]
[2017-04-12T15:06:03,412][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706056] overhead, spent [291ms] collecting in the last [1s]
[2017-04-12T15:06:09,901][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706062] overhead, spent [334ms] collecting in the last [1s]
[2017-04-12T15:06:19,401][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706071] overhead, spent [268ms] collecting in the last [1s]
[2017-04-12T15:06:26,943][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][706078] overhead, spent [379ms] collecting in the last [1.3s]


OK, I will try to reduce the shard size for next days. I will go on 1 index per shard, tux-prod will get 2 shards.

I keep you updated if / how the image is changing

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.