Hello everyone. I am new to the ELK world and this is my first post so I hope you will go easy on me.
I am currently working on a reporting system for one of our inhouse projects. The problem is that we have a lots and lots of data, that we need to handle. To give you an example this is one of our hourly indices:
green open node-requests-2019-12-11-22 asdasdASFasf3214234sdf 1 1 52593609 0 20.8gb 10.4gb
We have around 52 million records per hour, which is 1248 million records per day. So we needed something that can handle such loads and we ended up using ELK for the job.
Our set up is as follows:
web servers -> redis -> logstash -> elasticsearch
And we use kibana
(at least for now) to search the data.
As we have limited resources(the hard disk started to shrink really fast) . We decided to aggregated the hourly data using transforms. Each hour we started a new contentious transform, that aggregated the data into again hourly indices. After the transform had done it's job(aggregated the whole hourly data) we would stop it.
Running this setup performed quit well for the first 3 weeks, but after that our transforms started to get slower and slower(we also had a spike in our traffic, which is also relevant). But transforms, that previously finished for about an hour started to take double the time, this lead to multiple transforms running simultaneously and lead to overall cluster slowdown. So we stopped the transforms(as we also have other indices that contain more delicate data and we did not wanted to stop indexing them) and added some additional logstash
machines and one additional node
to the cluster.
My question is could this slowdown in performance be the result the much more sizable amount of shards that we had after a few weeks of uptime, as we have small shards, but our memory is relatively steady - clusterwide it hovers around 50-60%?
This is our current cluster setup:
3 master eligible data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive
2 data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive
The nodes are hosted on https://www.linode.com and are using CentOS 7.
Version of the elastic is 7.4.2
This is our 'daily sharding':
-
One index with daily data, that is not so write intensive with around
1557584
documents and916.8mb
of data per day. For this one we have setup3
primary and1
replica shard. -
Second index with daily data, that is not so write intensive with around
14710852
documents and7.1gb
of data per day. For this one we have setup4
primary and1
replica shard. -
24 hourly indices very write intensive with
50297848
documents and19.9gb
of data per hour. For every index we have12
primary and1
replica shards. Those indices we used to aggregate and then drop after 24 hours. -
24 Aggregated indices with
643414
documents and181.9mb
of data per hour. As they are quite small we allocated1
primary and1
replica shards, but it also is very write intensive as transforms are constantly creating and updating documents.
At the time of the slowdowns we ware using around(512mb) off swap memory on every node, which is now switched off. How much of a performance impact would have this added, if any as cluster did not run out off memory.
The maximum number of file descriptors is set to 65536
.
We also setup x-pack
to monitor the cluster, using the same cluster, how much of an impact should this have overall.
Also any recommendations on the size of a monitoring cluster?
Any pointers on how we can improve the sharding and performance of the cluster is appreciated.
Thank you in advance.