Hello everyone. I am new to the ELK world and this is my first post so I hope you will go easy on me.
I am currently working on a reporting system for one of our inhouse projects. The problem is that we have a lots and lots of data, that we need to handle. To give you an example this is one of our hourly indices:
green open node-requests-2019-12-11-22 asdasdASFasf3214234sdf 1 1 52593609 0 20.8gb 10.4gb
We have around 52 million records per hour, which is 1248 million records per day. So we needed something that can handle such loads and we ended up using ELK for the job.
Our set up is as follows:
web servers -> redis -> logstash -> elasticsearch
And we use
kibana(at least for now) to search the data.
As we have limited resources(the hard disk started to shrink really fast) . We decided to aggregated the hourly data using transforms. Each hour we started a new contentious transform, that aggregated the data into again hourly indices. After the transform had done it's job(aggregated the whole hourly data) we would stop it.
Running this setup performed quit well for the first 3 weeks, but after that our transforms started to get slower and slower(we also had a spike in our traffic, which is also relevant). But transforms, that previously finished for about an hour started to take double the time, this lead to multiple transforms running simultaneously and lead to overall cluster slowdown. So we stopped the transforms(as we also have other indices that contain more delicate data and we did not wanted to stop indexing them) and added some additional
logstash machines and one additional
node to the cluster.
My question is could this slowdown in performance be the result the much more sizable amount of shards that we had after a few weeks of uptime, as we have small shards, but our memory is relatively steady - clusterwide it hovers around 50-60%?
This is our current cluster setup:
3 master eligible data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive 2 data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive
The nodes are hosted on https://www.linode.com and are using CentOS 7.
Version of the elastic is
This is our 'daily sharding':
One index with daily data, that is not so write intensive with around
916.8mbof data per day. For this one we have setup
Second index with daily data, that is not so write intensive with around
7.1gbof data per day. For this one we have setup
24 hourly indices very write intensive with
19.9gbof data per hour. For every index we have
1replica shards. Those indices we used to aggregate and then drop after 24 hours.
24 Aggregated indices with
181.9mbof data per hour. As they are quite small we allocated
1replica shards, but it also is very write intensive as transforms are constantly creating and updating documents.
At the time of the slowdowns we ware using around(512mb) off swap memory on every node, which is now switched off. How much of a performance impact would have this added, if any as cluster did not run out off memory.
The maximum number of file descriptors is set to
We also setup
x-pack to monitor the cluster, using the same cluster, how much of an impact should this have overall.
Also any recommendations on the size of a monitoring cluster?
Any pointers on how we can improve the sharding and performance of the cluster is appreciated.
Thank you in advance.