I'm working with a dataset that fits decently and performs decently with 1 day's worth of data on a single EC2 instance.
This got me to thinking, there may be benefits to having single clusters per day - if I want to have 2 days of capacity, that's 2 smallish clusters with 4 total nodes (one replica machine for each machine), and writing cloudformation scripts or similar to build them out on a regular basis is not infeasible. 100 days would be 200 nodes, etc.
From a performance perspective, hard partitions are potentially pretty interesting for my use case, streaming data back from multiple clusters could result in interesting performance boosts.
If I want to increase performance on any particular day, I can add additional nodes to that small cluster, it's also extremely easy to roll data off by deleting the machines or archiving the cluster images.
I know I can use tribe nodes to connect all of the clusters and have a single endpoint from which to work. Dynamic configuration of the tribe node has me worried, though - it looks as though tribe nodes are managed based on a yml file at startup. Would I have to restart my tribe node daily and rebuild the yml file defining its constituent clusters each time? At that point it may make more sense for me to have a service that can publish ES clusters for consumer applications to use intelligently. When you do a read operation against a tribe node, can you query simultaneously query index_1 on cluster_1 and index_2 on cluster_2? Does the tribe node then condense the results from each cluster/index into an aggregate result?
Beyond the lifts of:
- Managing cluster creation
- Managing routing data to new clusters properly
- Managing tribe node constituent clusters and/or developing a client that can auto discover new clusters
Am I missing anything?
Are there compelling reasons to go with a single large cluster or with a swarm of smaller clusters?