we came from nagios logserver with a single node. It crashed quite often, but acted much faster than our current sinlge node instance on pure ELK 5.1.2.
Once we migrated we also changed our index / shard allocation. With logserver it was only possible to have one index per day which hold all types with 5 shards per index.
With ELK 5.1.2 we changed that. During developement everything seemed to be fine, but now we encounter big performance issues. Dashboards are often running into 30s timeout.
So I need your thoughts and advisory to tune the configuration and to know if and how I should upgrade the underlying VM. If possible we want to stay on a single machine, because the project is short on budget
First some information about our current shard allocation:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open tux-prod-2017.03.30 C-vON4hRQBKuGyyEADhhaw 5 1 8091818 0 9gb 9gb yellow open perf-staging-2017.03.31 9xj1n8s2SoydFP7Nf2sxkQ 3 1 1775777 0 1000.2mb 1000.2mb yellow open tux-staging-2017.03.31 fq1l1T-vQPSkInwL7mLKiw 5 1 318983 0 205.1mb 205.1mb yellow open perf-prod-2017.03.30 LozcJsYCQf28OwbS569_8Q 3 1 1137168 0 695.5mb 695.5mb yellow open perf-staging-2017.03.30 rtz5dZ8qR2KyZGnPM4iAbQ 3 1 1791326 0 1015.1mb 1015.1mb yellow open other-prod-2017.03.31 i2mumZn3TyKQV1RtM9XzHA 3 1 242674 0 111mb 111mb yellow open other-staging-2017.03.30 asFS-clSSvacRsUxbcW51A 3 1 19028 0 26.4mb 26.4mb yellow open perf-prod-2017.03.31 k1FYSgDEScia9iN88hxujw 3 1 1145884 0 697.9mb 697.9mb yellow open perf-no_stage_defined-2017.03.30 4i42u1CsREmOZPeKPRSiXw 3 1 126108 0 108.2mb 108.2mb yellow open tux-staging-2017.03.30 WKGfCXseR5KOwhF67UlZPg 5 1 351193 0 236.9mb 236.9mb yellow open perf-no_stage_defined-2017.03.31 PAXTC2FXQcynRwZwy1zkrA 3 1 127356 0 111.9mb 111.9mb yellow open tux-prod-2017.03.31 HX4IBwozStCtecQO8NT49A 5 1 7996508 0 8.9gb 8.9gb yellow open other-staging-2017.03.31 rOep5GTET5KBRoWRVyTIGQ 3 1 18229 0 23.9mb 23.9mb yellow open other-prod-2017.03.30 uwk2AEt1RRGKF6YTBpHIWQ 3 1 241231 0 108.9mb 108.9mb
We need to keep data of the last 40 days. staging indizes can be deleted after 14 days if needed.
What do you think about the shard allocation? too few / too much?
On a normal day, if our system (log producer) runs smooth, we got about 800k events per hour as peak. If the system has problems, we may have up to 1.1 mio events per hour.
we are using a single node on a VM:
28 GB RAM
Elasticsearch: 14 GB
Logstash: 1 GB
so maybe something is misconfigured here?
our usecases are:
- index logfiles to make them accessible via kibana
- use multiple dashboards in kibana to monitor and analyze logfiles of our systems
- in our dashboards we use aggregations like min / max / avg / percentile / term / filter
- minutely automatic monitoring of our system via elasticsearch data analyzing aggregations of the last minutes.
- we have about 20 probes running each minute
- we have about 5 to 15 concurrent users on the system
Any help is really appreciated.