Hardware & approach Recommendation to implement elasticsearch

subhraakanil · February 22, 2018, 7:09am

Hi All,

We are planning to use Elastic search in production for searching financial transactions considering its exceptional throughput time.We will call the APIs exposed by elasticsearch from our java application not form Kibana.It includes the following:

1. storing financial transaction in elasticsearch either one by one or by bulk upload.
2. Searching transaction (historical data) from elasticsearch.

We would like to hear your suggestions regarding the hardware for implementing it.Here are my requirements:

1. Projected Number of financials txns 3,000,000,000 per month.
2. Each transaction JSON size is 6000 bytes.
3. Data to be stored for 1 years.
4. Sorting to be done while searching.

Please suggest the Elastic Search Cluster setup for better performance.And, what is the best possible approach to consume elasticsearch API's from java application.

Christian_Dahlqvist · February 22, 2018, 7:34am

The size and specification of the required cluster will depend a lot on the nature of your data, how you map it, how you need to query it and what your latency requirements are. Restrictions win the hardware available may also have an impact. I would therefore recommend running some benchmarks in order to get an answer. This Elastic{ON} talk discusses how to perform such benchmarks.

rcowart · February 22, 2018, 11:01am

While at @Christian_Dahlqvist is correct that benchmarks are important, you will want to have a sensible starting point. Let's go through your numbers...

3,000,000,000 transaction records = 1157 records/second.

This really isn't a very high event rate. It could easily be handled with a 3 nodes cluster, so it isn't your limiting factor. Even if the daily peaks are 10 times the average rate, you should be fine.

The data is already JSON so we will ignore any amplification of the data for a conversion to JSON. At 100M records per day, and each record being 6K, you have 600GB per day. However that might not be the actual size of the data on disk. Elasticsearch isn't simply storing a JSON file, it is "indexing" the data to make it more searchable. Data is also compressed (using "best-compression" will save additional disk space). Worst-case you probably will need to store 600GB/day multiplied by the number of replicas (additional copies of the data for redundancy and HA). With only a single replica you will have to store 1.2TB of data per day (600GB x 2).

So now you can think about how many nodes you need to store 1.2TB/day for 1 year.

1.2TB x 365 = 438TB

A general rule for an Elasticsearch node storing time series data is that it should be able to store about 8TB of data.

438TB / 8 = 55 nodes.

So clearly data the data retention duration and record size are the top contributing factors to the number of nodes required.

Based on the information that you provided, I recommend you focus on understanding the storage requirements for your transaction records, and focus experimentation on how to reduce that requirement by optimizing which data is retained and how it is indexed. You can then size the cluster based on your retention requirements.

Some of the questions you will want to ask...

Do you really need all of the information in each transaction to deliver your use-cases? It is assumed each transaction is stored in an ACID-compliant datastore and Elasticsearch will be use for search on a few specific use-cases. If so you may be able to shrink the data volume considerably by discarding fields that don't need to be available to search.
Over which time period is the data most often queried? (hot/warm, or even hot/warm/cold, architecture may make a lot of sense.)
What is the rate of searches against the data set? (helps to determine the ratio of hot to warm nodes as well as the number of replicas)
What are the total number of documents and the size of the indicies? (may be able to store more than 8TB on a node)

About hardware...

The above assumes each Elasticsearch node has 64GB of RAM and SSD storage.
If warm nodes are an option, they would usually have HDD storage. However there is also an argument to make for using "pro-sumer" class SSDs in warm-nodes, especially when the use-case require queries over the entire dataset.
For search-heavy use-cases I tend to favor more RAM (96-128GB), to give the OS more capacity for page-caching. By holding a larger portion of the dataset in RAM, query performance can be improved significantly.

Hopefully that helps a little.

Rob

Robert Cowart (rob@koiossian.com)
www.koiossian.com
True Turnkey SOLUTIONS for the Elastic Stack

subhraakanil · February 22, 2018, 5:59pm

Thanks a lot.

subhraakanil · February 22, 2018, 6:01pm

Thanks a lot.I got a better picture.

system · March 22, 2018, 6:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.