Hi,
I know this might be a bit of a vague question, but hopefully I can get some insight into the hardware I'll need.
I'm collected netflow data and use Kibana to make visualizations. My dashboard has about 20 visualizations, bars, pie charts and tables. Most of it is are sums of total traffic per port/application/ip and sums of total data usage per day/month/week and and sum of data usage per ip per day/month/week.
This is the time a request takes for a single pie graph summing total.bytes per ip/port.
Hits 13885344 Query time 3705ms Request time 4895ms
This is the same visualization but inside my dashboard.
Hits 13885180 Query time 14151ms Request time 21212ms
So far I've noticed performance is scaling fairly linear. E.g. 60 days of data will take about twice as long to load than 30 days of data.
Based on my sample data I think I could end up with ~3.000.000.000 documents. Why kind of cluster would I need to be able to search through that data with somewhat acceptable performance?
Right now I'm running a single 15GB instance on Elastic cloud, weekly indeces and 1 shard per index. I'd say performance is reasonable at the moment (around 25 seconds to fully load a dashboard). But if I'd need to search ~200 times the amount of data I have right now, what kind of cluster would I be looking at?
My understanding is more shards spread over multiple instances will increase performance because searches will run in parallel. How about more shards per index on the same instance? Will that increase performance as well? How linear is performance scaling when you add an instance to a cluster (e.g. will double the instances give ~2 the performance?)
I'm not sure how to check server utilization on the Elastic cloud but before I had everything running on a AWS instance (16GB, 4 core) but as far as I could tell CPU utilization only spiked for a couple of seconds during a search.
The above example is a worst case scenario where all the data would be displayed. A more realistic use case is where the same amount of documents will be searched, but only 1/100th of the data is needed for aggegrations etc. (filtered based on the location of my devices)
TL;DR: What kind of cluster would I be looking at if I need to search and visualize ~3 billion documents in 1 ~ 2 minutes?