Memory issues and Query Limitations

Optimizing my system Step by Step, starting from 1 master-slave node.
I noticed 2 problems I was having. 1) Queries would start to time out due to the size of my cluster and 2) With Ingestion, it seems that shards would get larger than allowable memory and would get a lock status when attempting to add new data, creating a lock with the noticed log files showing gc allocation errors etc.

For step 1, i was thinking that I needed to create more nodes in order to spread the shards out. This would allow Multiple machines to query. So if 1 machine was having issues, we could double or triple the query machines in order to approx half the query times or so, less the overhead of nodes talking. I have been traveling a good bit, but if i recall, I can spin up another elasticsearch instance and it will auto join the cluster given information, and will start balancing itself. So as I realize there is a bottleneck, I slowly can add more machines to improve query times. "Oh its slow, ok. Lets spin up another X nodes for the cluster for search."

My other issue is ingestion. How do I handle a dataset which I know is finite. Well I need to better distribute the data across more slaves as well, which would distribute. The allocation issue I seem to get essentially creates a cyclical timeout with gc allocation. My test bed is limited on ram, which seems to correlate to the Issue. For example. I think the rule is no more than half the ram, but i remember setting Java Allocation to 4gigs or something since my machines had 16gb each. I am trying to think of a way to keep the shards under this amount so it doesnt create a lock issue and then backlog with logstash.

My original thought was to add more machines which would spread the shards out some. Given the fact i have 5 TB of files to ingest, I was thinking that I would want like 50 shards or so spread across all my systems, as well as redundancy built in, so it would all be cloned so nothing is destroyed if there is a crash, so at least 1 full clone of all the shards.

I was originally using a 4 Raspberry Pi to do this as a fun little test experience, but also noticed HDD throttling as it wasnt SSD but actual disk space, so a huge limiting factor was the transfer speeds between machines and HDD which looked like it would read and write nonstop.

I was originally hoping to get a bunch of this dockerized or through k8s to handle itself accordingly to spin up new machines to balance itself, and then just point logstash to it which processes and passes the ingestion.

The issue I was thinking to run into is Storage and where these would be installed at. I was thinking that while I could have all of it living in memory, that inevitably, I would want to back everything up to disk. The purpose for this is to after ingestion, I could save the data files for reinitialization later. I figured that it would be more expeditious to just restore the data instead of reingestion. The last time i tried to run the full set, it was ingesting for weeks. I figure if i can resolve the bottlenecks though, that the ingestion timeline might drop to 24 hours instead of weeks.

Does anyone have any dynamic K8S or Docker files I can run which will spin up and expand as needed? I wanted to set something simple up on Google Cloud systems this weekend as a test ingestion. I was thinking that doing some tests with Google will be expensive given the dataset, but it would be fun to try out. Note: The Elastic test bed is not large enough to ingest the data, so i need to set up Kibana and Elastic somewhere else since I want a complete picture and not a random snapshot.

I guess that comes down to; How big is your cluster? How much data? How many indices and shards?

If by memory you mean disk, yes. If you mean RAM, no.
If you are seeing errors, it'd be good to post them.

There's no concept of a slave node in Elasticsearch.

Correction. Let me address some of these things.

My base is that the cluster was 4x maxed Raspberry Pi 4 with 2 additional macbook pro running docker containers connected to the cluster. The issue I was having is that there were disk limitations with the machines with regards to Hard Disk space. With regard to the first point, if I recall the specifics, the timeout was generally resolved by increasing the size to distribute the queries across the nodes. So the issue was linear, the ingested material increased it would become slower to query with Kibana, and as I added additional nodes would speed back up.

With some tests, I was clocking 750k entries per minute for 2 weeks before shutting it down due to HDD thrashing. One of my attempts had 4 Containers running to a HDD for storage, so there was a lot of thrashing with regards to adding and deleting data per container. My WD HDD pretty much ran at 100% for 2 week and I was like "I need dedicated HDD per container to store the indices instead of 1lump place"

I think i need to recheck the fundamentals since it is proven that with a subset, it works just fine and I need to appropriately scale it out. Since I was just wiring stuff together without understanding limitations, I need to break things into more manageable chunks. Conceptually each machine needs dedicated storage since I want to ensure each Container running in my desktop has its own independent flash or HDD with an optimal throughput in order to limit thrashing and improve read/write. Then from there, as I need to expand and add new containers, I can set each one up accordingly.

1 Like