I'm interested in building a log management Elasticsearch implementation with simplicity prioritized over availability. We can live with some down time, but not data loss, and our upstream infrastructure will accumulate logs if Elasticsearch goes down.
We are looking at 3-6 nodes.
So what about running multiple nodes inside one host OS with lots of memory, discreet disks/arrays dedicated to each node? The host has redundant power supplies and spare parts.
As far as disk: We could either give each node its own RAID 10 array and have no replicas, or use replicas and give each node portion of the drives and use multiple path. RAID 10 seems like it would be faster and less complicated. I guess the argument could be made that by using replicas over RAID10 would allow the cluster to keep working if a node failed. But how often will one node out of several all running on the same host happen and would it be long lived or more likely a memory violation solved by restarting that node?
Not worried about confusing the nodes when it comes to administration. It seems, for this size cluster and our requirements, one host OS, one installation of Elastic, running X times with different folders and parameters is far less complicated.
Well I thought and thought about what you said and what it could mean and I think I linked it up with something that’s been bugging me in the back of my mind.
I keep seeing guidance from folks on scaling Elasticsearch that links data-on-disk to RAM with a required ratio. Anywhere from 1:1, ideally, and all the way up to 1:48 or even 1:96 for log data. I’m glad to see the ratio increase for log data but even 1:96 tough to build for when you are thinking about 40,000 workstations each generating 100mb a day. It really limits how long you can keep that data around. Dwell time before detection for attacks is often 6 months or more.
But if I understand what you are saying, I’m getting the idea that something I intuited about “working-set” principle is true: it’s not strictly about how much index data is on disk that determines memory needs but how much of that data you are trying to query at one time. That would explain why log data by rule of thumb allows a higher disk/RAM ratio than say a search engine for intranet content.
So: if create difference indexes on a given interval – such as daily – as beats does by default, can you stretch that 1:96 further, if the following are true:
FWIW, taking the 1:96 guidance as though a strict rule, with 40,000 endpoints producing 50meg per day, here's what I come up with in terms of nodes, RAM and disk needed in order to store various amounts of history.
The RAM and Node quantity grow linearly along with the disk space.
That's a big It Depends.
If you have a look at some of the other sizing questions you will see things you need to consider.
That's really a business cost that needs justification, it's not a technical issue.
Primarily, yes.
It Depends. I would suggest rolling out to a subset of the hosts and see what you can and can't do around storage capacity. Then you need to consider querying capacity, there's no point storing massive amounts of data and then having to wait minutes/hours for response times when you need access now.
Thanks, back to the other aspect of my question. What is the downside/disadvantages to running multiple instances in one host OS instead of VMs on the same single physical host with adequate hardware. Assume OS file system can handle all the open files, abundant cached, etc. Just the possibility of human error with confusing data directories etc?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.