Hi everyone, i'm looking for some input/feedback /guidance on how i use elastic both from a hardware point of view and how i utilising docker.
First everything below has grown from self interest, i don't operate a business on the back of any of this stuff, its more of a hobby for me thats grown and got a bit out of control. I have a passion for scraping data, and seeing what you can do with the large amounts of data, and elastic has played a big part in my journey of self progression and learning.
Ok, so I have 2 machines which are an identical spec, with the following
i5 6core CPU
64GB RAM
5x 4TB Samsung SSD
512GB M2
Using docker i have a 4 Node Cluster, each node has 8GB RAM, 1x4TB SSD. With 1 primary and 1 replica. My thought process here is that if one drive failed for what every reason only it would be like that Node dying, and the cluster could rebuild using the Primary and Replica shards from the other Nodes.
When i first started this approach it was when i only had 2x 4TB SSD and 32GB of ram, and as i have upgraded my system either wanting more memory or more storage, i could add another disk, assign it to a new node in my docker compose file and bring that node online, allowing me to grow easy with little hassle from a 2x Node cluster to a 4x Node cluster. And as of today if i was wanting to increase my storage, i could just add another 4TB SSD drive and add a 5th Node online.
So I guess i am just using Docker for ease and to replicate what would be multiple physical hosts as nodes in my cluster are represented virtually using containers.
As time as gone on, i purchase the same hardware to build a second machine and run that in the same way. One a Dev Environment of sorts and the other a Production Environment, when in reality i just store different datasets on each, but then find myself with the pain of moving one large index from one cluster to the other.
What i'm trying to understand while leveraging docker in this way with elastic, at what cost does it come. I have thought about starting from scratch and treating each machine as a single Node giving me 2x Nodes, each node having all the storage on that machine as stripe raid 0.
If one machine was to blow up or a disk on that host, then at least a copy of all that data would be on the other host and it could rebuild. If i wanted to introduce more storage or resources i could build another machine and add that to the cluster/network. Or perhaps i could add another drive to that actual host?! ( pros? cons? )
Would i get much better performance doing it properly over the docker way? and if so is it a meaningful and worth while difference.
All of the above has been on my mind for some time now, and more so now as i am looking at upgrading my hardware, but then not knowing what hardware to get that will help improve the performance. Do i get more cores? will an i5 to i7 make much of a difference. Upgrade from 64GB to 128GB, what type of memory etc etc. I know more memory and faster storage can make the big differences, but thats one area my knowledge is limited.
Its also worth noting that i am storing a LOT of data, I have a few indexes which range from 1-3billion documents in them and this number could grow in some use-cases to double that.
one index i have which has almost 3billion documents has 7 key value pairs, where as another index i have anywhere up to 100+ key value pairs nested all over with around 1.2billion documents.
On these machines i run other small projects within their own docker containers; like web scrapers, which collecting data that then gets written to elastic and/or maybe MySQL, i have other additional storage disks for this with the intension Elastic data is kept on different drives than MySQL data and if one goes craps out it does impact the other service.
When working with the data, i might do some enrichment processing, like extracting unique entitles and rolling them up on the documents in the way of updating them. I sometimes do a fair bit of scan and scroll to export data and work on that, and then load it back in as a new index to experiment on.