Optimizing my system Step by Step, starting from 1 master-slave node.
I noticed 2 problems I was having. 1) Queries would start to time out due to the size of my cluster and 2) With Ingestion, it seems that shards would get larger than allowable memory and would get a lock status when attempting to add new data, creating a lock with the noticed log files showing gc allocation errors etc.
For step 1, i was thinking that I needed to create more nodes in order to spread the shards out. This would allow Multiple machines to query. So if 1 machine was having issues, we could double or triple the query machines in order to approx half the query times or so, less the overhead of nodes talking. I have been traveling a good bit, but if i recall, I can spin up another elasticsearch instance and it will auto join the cluster given information, and will start balancing itself. So as I realize there is a bottleneck, I slowly can add more machines to improve query times. "Oh its slow, ok. Lets spin up another X nodes for the cluster for search."
My other issue is ingestion. How do I handle a dataset which I know is finite. Well I need to better distribute the data across more slaves as well, which would distribute. The allocation issue I seem to get essentially creates a cyclical timeout with gc allocation. My test bed is limited on ram, which seems to correlate to the Issue. For example. I think the rule is no more than half the ram, but i remember setting Java Allocation to 4gigs or something since my machines had 16gb each. I am trying to think of a way to keep the shards under this amount so it doesnt create a lock issue and then backlog with logstash.
My original thought was to add more machines which would spread the shards out some. Given the fact i have 5 TB of files to ingest, I was thinking that I would want like 50 shards or so spread across all my systems, as well as redundancy built in, so it would all be cloned so nothing is destroyed if there is a crash, so at least 1 full clone of all the shards.
I was originally using a 4 Raspberry Pi to do this as a fun little test experience, but also noticed HDD throttling as it wasnt SSD but actual disk space, so a huge limiting factor was the transfer speeds between machines and HDD which looked like it would read and write nonstop.
I was originally hoping to get a bunch of this dockerized or through k8s to handle itself accordingly to spin up new machines to balance itself, and then just point logstash to it which processes and passes the ingestion.
The issue I was thinking to run into is Storage and where these would be installed at. I was thinking that while I could have all of it living in memory, that inevitably, I would want to back everything up to disk. The purpose for this is to after ingestion, I could save the data files for reinitialization later. I figured that it would be more expeditious to just restore the data instead of reingestion. The last time i tried to run the full set, it was ingesting for weeks. I figure if i can resolve the bottlenecks though, that the ingestion timeline might drop to 24 hours instead of weeks.
Does anyone have any dynamic K8S or Docker files I can run which will spin up and expand as needed? I wanted to set something simple up on Google Cloud systems this weekend as a test ingestion. I was thinking that doing some tests with Google will be expensive given the dataset, but it would be fun to try out. Note: The Elastic test bed is not large enough to ingest the data, so i need to set up Kibana and Elastic somewhere else since I want a complete picture and not a random snapshot.