Elastic cluster slow down afre a few weeks of uptime(cluster recommendations)

Hello everyone. I am new to the ELK world and this is my first post so I hope you will go easy on me.

I am currently working on a reporting system for one of our inhouse projects. The problem is that we have a lots and lots of data, that we need to handle. To give you an example this is one of our hourly indices:

green  open node-requests-2019-12-11-22                    asdasdASFasf3214234sdf 1 1 52593609       0  20.8gb  10.4gb

We have around 52 million records per hour, which is 1248 million records per day. So we needed something that can handle such loads and we ended up using ELK for the job.

Our set up is as follows:

 web servers -> redis -> logstash -> elasticsearch

And we use kibana(at least for now) to search the data.

As we have limited resources(the hard disk started to shrink really fast) . We decided to aggregated the hourly data using transforms. Each hour we started a new contentious transform, that aggregated the data into again hourly indices. After the transform had done it's job(aggregated the whole hourly data) we would stop it.

Running this setup performed quit well for the first 3 weeks, but after that our transforms started to get slower and slower(we also had a spike in our traffic, which is also relevant). But transforms, that previously finished for about an hour started to take double the time, this lead to multiple transforms running simultaneously and lead to overall cluster slowdown. So we stopped the transforms(as we also have other indices that contain more delicate data and we did not wanted to stop indexing them) and added some additional logstash machines and one additional node to the cluster.

My question is could this slowdown in performance be the result the much more sizable amount of shards that we had after a few weeks of uptime, as we have small shards, but our memory is relatively steady - clusterwide it hovers around 50-60%?

This is our current cluster setup:

3 master eligible data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive
2 data nodes -> 8 Core CPU, 32 GB RAM(16GB for elastic, 16 GB for the OS), 640 GB hard drive

The nodes are hosted on https://www.linode.com and are using CentOS 7.
Version of the elastic is 7.4.2

This is our 'daily sharding':

  1. One index with daily data, that is not so write intensive with around 1557584 documents and 916.8mb of data per day. For this one we have setup 3 primary and 1 replica shard.

  2. Second index with daily data, that is not so write intensive with around 14710852 documents and 7.1gb of data per day. For this one we have setup 4 primary and 1 replica shard.

  3. 24 hourly indices very write intensive with 50297848 documents and 19.9gb of data per hour. For every index we have 12 primary and 1 replica shards. Those indices we used to aggregate and then drop after 24 hours.

  4. 24 Aggregated indices with 643414 documents and 181.9mb of data per hour. As they are quite small we allocated 1 primary and 1 replica shards, but it also is very write intensive as transforms are constantly creating and updating documents.

At the time of the slowdowns we ware using around(512mb) off swap memory on every node, which is now switched off. How much of a performance impact would have this added, if any as cluster did not run out off memory.

The maximum number of file descriptors is set to 65536.

We also setup x-pack to monitor the cluster, using the same cluster, how much of an impact should this have overall.

Also any recommendations on the size of a monitoring cluster?

Any pointers on how we can improve the sharding and performance of the cluster is appreciated.

Thank you in advance.

Did you consider rollups for this?

Yes, swap can have a severe performance impact.

It sounds like you are creating a lot of undersized shards, and may see better performance with fewer larger shards instead. Here is an article about this:

Also 12 primaries seems like a lot for 2 data nodes. Why not 2 primaries?

Also you might like to consider ILM, or at least rollover, so that you can target a particular shard size instead of doing this purely by time.

In Kibana monitoring, nodes, advanced tab, check the number and duration of GC's. As heap is used, GC's become more frequent to try to find free space.

In the OS, use something like atop to see how your disks are preforming, my guess is that they are very busy :slight_smile:

My guess is that you need more data nodes. You may be over allocating shards as well. WIth replication on 2 nodes, you have a copy of everything everywhere. As long as the shards are under 20-50G, you don't need more than 1.

If the $'s are there, I'd start with adding 2 more data nodes. Follow a philosophy of "just in time", add resources as you need them. Doing is a much better sizing method that forecasting :slight_smile:

Yes we have considered using rollups. There are two things that are currently, somewhat stoppers for our usecase.

The first thing that bothers me is that currently rollups searches, do not support the index-name-* syntax and our current applications uses this. Also I did not fully understand, how the rollup_indices will play together with the ILM. So let's assume that we have one rollup_index, that aggregates the data and we also define a ILM rollover policy to rollover my index, when it grows for example till 30GB of size to keep it from growing indefinitely.

If I need statistics over the how dataset(lets say we have already rolledover a few indices), how would I do it?

Thanks I went over the article.

Sorry I got you confused with my nodes description, we have 5 data nodes, 3 of which can become master. So basically 4 data and 1 master/data nodes.

I assume good sharding in this case would be 5 primaries?

By the way thanks for the quick replays :smiley:

Not sure, if this is normal for the GC?

Also I am interested in how much of an impact oversharding has on the overall cluster performance. Can you point me to a good read on the topic?

This is the current cluster that we are using

*Sorry for the multiple replies.

The heap and GC stats look OK.

Your node stats show max CPU of 100%, short periods of 100% CPU may be OK, but if it's frequent, not OK.

On both data nodes do "iostat 5 5" during a busy time. What is the highest value for %IOWAIT?

My opinion on swap usage: If you have some swap space used, but little to no swap in/out activity, you are OK. You can check that with "vmstat 5 10" and look for values in the si and so column.

1 Like

@rugenl thanks for the help.

I run the istats a couple of times during the past days and this is the highest values for iowait I got.

Did the same with memory stats, swap was not longer used.

The activity seems to be writing to /dev/sda, is that where your elastic data lives?

Yes, this the device where elastic stores the data.

The steal and iowait numbers suggest to me that ES just doesn't have the CPU/IO resources to do its job.

3 Likes

After some comparison of all that statistics on all our datanodes I found out the one of them is experiencing a lot more steal time(around 30% in comparison to the other nodes around 6%).

What could be the reasons one node to be working more then the others, besides bad sharding?

Also I will contact our cloud provider, the issue could be overworked physical machine.

If one of them has a much higher steal time than the others, it's not working more--- it's working less. As the doc I linked to suggests, you could try moving it to a different physical host with less noisy neighbors.

If CPU is much higher on one node than the others, in my experience it's usually due to heavy segment merging brought on by frequent updates to one or more documents that happen to fall on shards on that node. A quick look at hot threads will tell you.

If you have most of your indices set to 1 replica and only 2 nodes, your nodes are probably doing nearly identical work.

Just so I understand your layout, (3 master, 2 data nodes) post the output of "GET /_cat/nodes?v"

192.168.146.106 70 96 13 1.53 1.59 1.62 dilm - node-2
192.168.160.192 54 97 11 1.91 2.11 2.23 dil - data-node-1
192.168.202.137 53 94 16 1.59 1.89 2.09 dilm * node-3
192.168.213.112 44 97 18 1.71 1.89 2.00 dilm - node-1
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name

We scaled down by removing one of the nodes dues to it's performance issue.

All your nodes are data nodes, you don't have dedicated master nodes or dedicated data nodes. That is the "d" in "dilm" and "dil".

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.