Elasticsearch Topology Design

Hello Elasticsearch big-brain experts,

I have been testing out Elasticsearch as a log vaulting, auditing tool, and SEIM for the last few months and my company has authorized the purchase of hardware to build a cluster. I have gone through a lot of documentation and put together a proposal but would like to pass my design to the community to see if I have any glaring flaws or misunderstandings. I have spec'd out 3 types of servers below:

Hot Storage:

  • 2x 8 core Intel 3.1ghz procs
  • 128GB ECC Registered Memory
  • 1x 128GB m.2 (for OS)
  • 16x 2.8GB Hot-Swap NVME SSD's
  • 2x 10Gbase-T LAN on Motherboard
  • 1x 2 port 1Gbase-T PCI card (for management)

Warm Storage:

  • 2x 8 core Intel 3.1ghz procs
  • 128GB ECC Registered Memory
  • 12x 4GB HDD's in RAID 5
  • 1x 128GB m.2 (for OS)
  • 2x 10Gbase-T LAN on Motherboard
  • 2x 1Gbase-T PCI card (for management)

Auxiliary Servers:

  • 2x 8 core Intel 3.1ghz procs
  • 128GB ECC Registered Memory
  • 8x 250GB SSD's
  • 1x 128GB m.2 (for OS)
  • 2x 10Gbase-T LAN on Motherboard
  • 2x 1Gbase-T PCI card (for management)

My intention is to purchase 4 hot storage servers, 4 warm storage servers, and 3 auxiliary servers. The hot and warm servers will make up the data nodes as well as masters. I plan to have indices on the hot storage have 2 shards with 1 replica allowing for 2 servers-worth of fault tolerance. After 60 or 90 days I plan to have indexes move to the warm hosts where they will go down to 1 shard with a replica.

The auxiliary servers I plan to use as a client node (data=false master=false) to speed up searches, a dedicated parser for logstash, filebeat, etc., and a kibana front-end.

All servers will all be on the same 1Gbase-T switch with 2 lines each configured with an 802.3ad LACP bond for data and cluster communication, and a separate 1G connection to another switch for management (ssh, scp, configuration management, etc). I have forgone dedicated master nodes because I read that it isn't really required until you get over a 10 node cluster but I'm not married to this idea if it is wrong.

Please let me know if there is anything that I have missed or if this configuration is a substandard or ill advised. I do want to know where my baby is ugly on this one because I don't want to waste the money or cycles on fixing my configuration.

Or just tell me that I did a great job and should move forward. My ego can always use a boost. :slight_smile:

A few things;

Yea I mistyped. The storage is in 16x 1.8TB for the hot nodes and 12x 4TB HDD on RAID 5 for the warm nodes. This should give me around 73 TB for hot storage and 132 TB (I think) for warm storage ((4*12-4)4 - (412-4)). Sorry about that.

I would like to know what the benefit would be to running multiple docker images per host instead of just assigning 1/2 the memory of the host (64GB) to run one instance. That is what I have seen in documentation so far. Just specifying my thought processes and not trying to argue anything.

As far as the auxiliary nodes go, I don't really have a reference point to determine how much in the way of resources a client node is going to consume. Per my understanding I will be pointing Kibana and any log sources at the client node to either start indexing or perform searches. So if anyone has any guidelines as to how big I need to have that in relation to the data/master nodes that would be great. I have some idea in the way of a dedicated parser. I threw some netflow from a couple core routers at a really beefy VM and that beat the heck out of the CPU and memory so I would like that to stay on some pretty beefy hardware because I'm going to be throwing a lot more at it than what I gave the test/dev/qa instance.

I am pretty sure that the Kibana hardware is way more than is needed but limiting the hardware configurations made it easier to build a menu so to speak of hardware needed for expansion if ops or dev or infrastructure wanted to get in on the benefits that ES can give them. I probably won't need to upgrade that for a long time but if someone wants to start onboarding non-security stuff into the cluster I have a price that I can give them for upgrading the storage nodes depending on their log retention period, or creating dedicated master nodes, or increasing client nodes and putting them behind a load balancer. It just gives me some standards that I can hand out easily with quotes.

We recommend <32GB per heap, so multiple nodes means best resource usage.

I see. So breaking the 8 64GB hosts into 16 32GB hosts will provide better resource utilization. I'm going to look into that then.