While at @Christian_Dahlqvist is correct that benchmarks are important, you will want to have a sensible starting point. Let's go through your numbers...
3,000,000,000 transaction records = 1157 records/second.
This really isn't a very high event rate. It could easily be handled with a 3 nodes cluster, so it isn't your limiting factor. Even if the daily peaks are 10 times the average rate, you should be fine.
The data is already JSON so we will ignore any amplification of the data for a conversion to JSON. At 100M records per day, and each record being 6K, you have 600GB per day. However that might not be the actual size of the data on disk. Elasticsearch isn't simply storing a JSON file, it is "indexing" the data to make it more searchable. Data is also compressed (using "best-compression" will save additional disk space). Worst-case you probably will need to store 600GB/day multiplied by the number of replicas (additional copies of the data for redundancy and HA). With only a single replica you will have to store 1.2TB of data per day (600GB x 2).
So now you can think about how many nodes you need to store 1.2TB/day for 1 year.
1.2TB x 365 = 438TB
A general rule for an Elasticsearch node storing time series data is that it should be able to store about 8TB of data.
438TB / 8 = 55 nodes.
So clearly data the data retention duration and record size are the top contributing factors to the number of nodes required.
Based on the information that you provided, I recommend you focus on understanding the storage requirements for your transaction records, and focus experimentation on how to reduce that requirement by optimizing which data is retained and how it is indexed. You can then size the cluster based on your retention requirements.
Some of the questions you will want to ask...
- Do you really need all of the information in each transaction to deliver your use-cases? It is assumed each transaction is stored in an ACID-compliant datastore and Elasticsearch will be use for search on a few specific use-cases. If so you may be able to shrink the data volume considerably by discarding fields that don't need to be available to search.
- Over which time period is the data most often queried? (hot/warm, or even hot/warm/cold, architecture may make a lot of sense.)
- What is the rate of searches against the data set? (helps to determine the ratio of hot to warm nodes as well as the number of replicas)
- What are the total number of documents and the size of the indicies? (may be able to store more than 8TB on a node)
- The above assumes each Elasticsearch node has 64GB of RAM and SSD storage.
- If warm nodes are an option, they would usually have HDD storage. However there is also an argument to make for using "pro-sumer" class SSDs in warm-nodes, especially when the use-case require queries over the entire dataset.
- For search-heavy use-cases I tend to favor more RAM (96-128GB), to give the OS more capacity for page-caching. By holding a larger portion of the dataset in RAM, query performance can be improved significantly.
Hopefully that helps a little.
Robert Cowart (firstname.lastname@example.org)
True Turnkey SOLUTIONS for the Elastic Stack