I have an ELK setup in my org. We're working on improving the reliability of the system. We have 22 data nodes running in the cluster
Having co-ordinator nodes is something that was suggested to us. We want to make sure having co-ordinator nodes would benefit us. How do we decide on this decision? Is there a set of metrics with which we can decide?
I'm also fairly new to the ELK system. I have read about why co-ordinator nodes are used and how it can help. But i'm still confused if it is right for us to use co-ordinator nodes
Is there a specific metric with which I can track how many requests does each data node in my current cluster get for which the data is not within the shard that it holds?
This implies some current/recent unreliability, right?
Can you describe in what way the current system is unreliable? Does something crash, if so what specifically, and how often? Do you know/understand why?
Or does it just get (too) slow? If so, what specifically gets slow?
Helpful would be, in very general terms, what does the cluster do? Was it always 22 nodes, or is it growing ? Do you have some documentation on why you have a 22 node cluster, and why not a 20 or 25 or 50 or whatever node cluster - I mean where does that size come from, e.g. some calculation based on assumptions?
The specific co-ordinator node question IMO jumps a little bit ahead, let's see if we can better understand the problem you need fixed.
Yes. It has been happening time to time. But is was not occasional and keeps happening, we decided it's time to improve it.
ES goes to yellow status or red status sometimes
Users get 5xx when accessing Kibana due to this and it has been a pain
This does not happen always when ES is yellow or red
Sometimes even when ES is yellow or red, it never becomes an issue for the end users
The issue is mostly with ES. But what exactly with ES, we don't really know. We started digging a bit. Some of the reasons that we identified for the past 2 incidents were these:
A memory spike observed for the ES nodes (but not in JVM heap) due to which somes ES nodes were unavailable
ES being overwhelmed by high volume of logs and started ratelimiting writes from logstash. This directly affected users trying to query in Kibana
Queries in kibana gets slow or gets timed out or throws 5xx.
ELK is our central system that ingests request logs for the past 90 days from all prod systems running in AWS
No, it is growing with the size and volume of our logs. It has grown to 22 nodes over the years.
Currently, the ES cluster holds around 27.TB data and might go upto 30TB till the end of the month
Let me know if these details help answer my query and identify the problem. Will be happy to share more details if required. Thanks for your response @RainTown
As I asked you to park the original question, lets go back to that first:
So, answer from me is "I dont know". As of now, I dont think it's your main issue.
Maybe good news, maybe bad, but I think yours is the most interesting thread on the forum right now. So I hope others will weigh in. You are likely going to have share more information too. I'm going to ask a few things:
Are you using data streams, and if so whats the index rollover policy
How many indices are being heavily written to simultaneously, essentially how are your AWS logs being sliced and diced into indices ?
How many shards and replicas per (important) index ?
Is your log flow reasonably constant, e.g. per hour / day, or does it have significant spikes, and if the latter do your cluster issues correlate with the log volume spike?
someone will always ask "what version", so please share that too.
If you have both a stability and performance issue, then (wherever I've worked) stability is wayyyy more important. That nodes are falling out the cluster even semi-regularly would be my first concern. But I'm old school.
So that likely means some shards are unallocated for a time, but those shards aren't those that your users are querying most at the specific time. To be honest, there is a likely lot of luck involved there.
So, e.g. a pretty bad design here would be to have all the logs going into a single data stream, therefore one index at a time is being written to, with say 5 shards and 1 replica. Even if that specific index rolls over pretty quickly and over a day there's lots of backing indices.
btw I've also seen cluster fall over just as a result of a terrible kibana query, someone settings timespan to "a year" for a view that is really intended to work OK on "last 15 minis". Those are (IMO) particularly hard to find, though that was in 4.x/5.x/6.x times, so maybe there are better control/protections now.
From what you described I don't think coordinating nodes would solve anything, it looks like that your cluster is undersized for the ingestion and query rate you have now.
How your nodes are distributed? Do you have dedicated masters and data tiering (hot-warm)?
What are the specs of your nodes and which version are you using?
I managed a similar sized cluster with the double of disk usage, without any issues, but I had dedicated masters and hot-warm tiering, with all the communication from clients (ingestion and kibana) being done only on the hot nodes since they had more CPU and Memory.
Along with what others have asked, what are the functions of your 22 nodes?
Any/number of: dedicated masters, hot, warm, cold, frozen? Where does Kibana run? What version of Elastic? Any frozen searchable snapshots? How old is this cluster?
Cluster's evolve. We had basically a 2 location on-prem cluster. 6 hot nodes, 3 at each location, 1 at each location was a master. There was a 3rd "voting only" master at a 3rd location. There was a big-disk cold node at each location and a co-ordinating node with Kibana at each location.
Each hot node ran logstash and ingested to it's "local" elastic. All indices had 1 replica, so there was always a primary at one location and it's repliac at the other. We had a lot of different indices, if any had more than 1 shard it was rare, but with a lot of different indexes, ingest was to many shards.
Each Kibana connected to the co-ordinating node that was "local" to it. These were behind the "pretty dsn" for Kibana, API and some trivial ingest. The beats we managed all sent to our 6 logstash in round-robin. This cluster predated the Elastic agent and logstash pipelines were more popular that ingest pipelines.
Sorry for the long story, but why co-ordinating nodes? Queries (Kibana and api) used their stack memory to aggregate instead of the data nodes. We would have cases where bad Kibana queries would cause heap problems. The rest of the story is don't let people see "Advanced Settings" in Kibana and definitely don't let them change the max buckets settings. Even if they are your boss.....
No. I think data streams are supported since ES 7.9 only
Our major indices are monthly based. Since we retain logs for atleast 90 days, we have 3 indices present, each one for the past 3 months and one more index that is ingesting logs for the current month
Whatever I mentioned above was for one such service in our org. We have 5 major services. So we have 5 * 4 -> 20 major indices where writes only happen to 5 current indices for the month
Yes, it peaks during the working hours of the day and drops down at night
We only store the past 3 days of monitoring data. So it is hard for me to comment on this. In the case of an incident next time, I can try comparing the EVENTS-IN metric with other days
I think this could still happen to us since we're still on 6.x. But how do i verify if this could be the case? For us the max window could be around 120 days
Before I forget, I think you should keep the monitoring data longer than 3 days, at least for now. As you maybe start changing stuff, you need decent before and after periods to compare to make an informed judgement on "did changing that help, and if so how much?".
By my maths, if the 8 billions docs from a screenshot above are covering approx 90 days and consuming the 27 TB, your cluster is averaging only around 1100 docs / second when considered over the whole period, and average doc size is also not huge either at ca: 3.5k or 1.75 k (I can't recall if that number shown by kibana 6 is just the primary storage, or includes the replica too.)
At that rate, your cluster should very easily cope. OK, you said your load increases significantly during the day, but still even if it were 10x the average I would still think your cluster should cope. Though we dont yet know the specs of each node.
I've still got strange hunch that unhelpful user activity is a factor here. Cos I've been bitten by that in the past. And to my recollection kibana 6 did not have a lot of tools to help check/validate/track this.
I'd look at the kibana 6 equivalent of 22 (or even 25) screenshots like the one below, taken from latest kibana, for a whole, typical working day. In a well balanced cluster the 22 data nodes should look all broadly similar.
Also, I'm a bit of symmetry lover, but 5 important indices with 5 primary + 5 replica shards each is 50 shards being heavily written to simultaneously across 22 data nodes. At best, you have some nodes with 2 busy shards / node and some with 3 busy shards / node. But you can fairly easily get an even more imbalanced write load across the nodes. But again, 2 or 3 or even kore heavily writing shards per node should not really create an issue unless the nodes (or IO) is really underpowered, or overpowered by expensive searches/visualisations, or something else we dont know about.
What is the hardware specification of the nodes in the cluster? Which instance type(s) are you using? What size and type of storage do the nodes have? Do all data nodes have the same specification?
27TB across 640 shards give an average of 42GB if I calculate correctly. Is that the average shard size for the important indices or does the 640 shards include a lot of small ones that distorts the number? What is the maximum shard size?
We thought about it. But we'll have to estimate how much increase in cost would that be depending on how long we're planning to store.
But for now, when there's an incident.. is there any specific metric that I should look out for and make some sense with it?
I don't think this average would be a fair measure. Yesterday, indexing rate peaked around 25,000 docs per second for both primary and replica shards which could be an exceptional case as well, since it also lead to degradation to the end users. Will mention more about this on the next trhead
For us the log size or the document size also varies based on the index.
At peak (atleast with yesterday's data), it is 25x the average tho
You're right. Average wouldn't make sense here. We have the smallest shards in KBs as well (dead-letter-queue-indices). The average shard size per index for the top 6 indices would be:
I rarely use c5 instances for data nodes as a higher RAM-to-CPU ratio is preferable given that Elasticsearch IMHO rarely is limited by CPU. Before makng any changes, make sure you check your monitoring data.
These nodes have CPU credits so can be throttled during periods of high workload. I would recommend verifying you are not experiencing any CPU throttling. These instance types are often used for dedicated master nodes so I would expect there to be no problems.
Even though it is better than gp2 EBS it is still not very fast storage and I have on numerous occasions seen it be the main performance bottleneck. I would recommend you check disk utilisation and await, especially during periods where you see performance issues. Poor storage performance is one of the most common performance bottlenecks.
The general recommendation is to keep shards sizes between 30GB and 50GB, but that does vary depending on use case (can be lower, but rarely higher in my experience). I would recommend increasing the number of primary shards for the larger indices or change your rollover settings if you are using this.
This is possibly one area where the old 6.x version hinders you. My own recollection (from 4.x/5.x/6.x) was the access control was not as fine grained as it is now, But we did not even have SSO enabled, people just shared accounts. So (e.g) a whole department, dozens of people, had effectively full or close-to-full access to kibana, meaning can essentially query anything, which was exactly our situation. Then the tooling did not (and probably cannot) help. We did find one guy had made a dashboard with a terms aggregation on _id, obviously not through any malice, just didn't realize it was a daft idea.
In terms of access control, my impression is things are significantly more fine qrained now. As to within kibana or elasticsearch itself, I don't know, one presumes these things just get better as the software matures.
So, this is the key period you need to study hardest. You had, per index, 5 shards and 1 replica. So if those 25k docs/second are dominated by (eg) a single index, then you have (at best) an effective 10-node cluster. This you can see in the monitoring tabs. I'd take really close look at the monitoring data for all 22 data nodes, and the 3 masters, for all typical working days, and around those peak indexing periods. What you are looking for is how homogeneous is the load across the cluster, and to what extent some nodes deviates from others, and how many nodes do so.
As @Christian_Dahlqvist also said, would be useful to have something like iostat running continuously on the nodes for checking the IO bottlenecks. IO slowness creates back-pressure, and that can be a kind of snowball effect.
I think you also noted you had monthly indices, and no data streams cos it's the old version. So once your index gets created, shard P1,2,3,4,5 goes here and shard R1,2,3,4,5 goes there, that'll stay that way for all month unless elasticsearch decides to shuffle that around, which it probably will only do when your cluster say loses a node temporarily. This has potential to create hot spots.
You might also wish to check if the shards are just sticking on same nodes, I expect they would be. And the symmetry issue I described above, your 5 most-dynamic indices have 25P + 25R shards - how are they actually distributed. It could even be that your most heavily written to index has both a primary and replica shard on the same node.
One last point, there is tool called cerebro which has web interface and allows (IMO) a nicer view into how the indices and shards are spread across the cluster. It is no longer developed, does not support 8.x, but for sure can work with 6.x
Yea, I noticed this as well. Moving to an m-series instance type is one of the action items that we can work but might not be the lowest hanging fruit. The peak average CPU utilization for all the data nodes for the past 4 weeks is just 25%
The t3a nodes have Unlimited mode enabled for bursting. So we should be fine
I see. Adding kibana upgrade to my list of potential action items to improve stability of the system. Is there anyway I could see how much impact currently the bad queries have on my system?
Sure. Thanks for this. Will be monitoring the following for the next few days - overall, during peak hours and during degradations
Indexing rate per index
System metrics like load, CPU utiliation, Heap memory
EBS volume metrics
Would load average in the Monitoring tab not include I/O as well or is it just for CPU?
Sorry, didn't get this part. What is "here" and "there" here?
I'll have to check this. I'll confirm this in sometime.
I checked this. All primary shareds and replica shards are on separate nodes for all 5 main indices.
Thanks. Currently using the ES api to get the shard details and filtering them with gsheets.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.