We are looking for solutions for the following requirements:
10 million incoming structured records per second (1 GB of data per second)
the data has to be available for queries within 2 seconds of its arrival
10^15 records in total (managed by date)
60K parallel simple queries on any combinations of the columns (no joins are required)
Is Elastic a good candidate for these requirements?
If so, how can we estimate the cluster(s) topology(ies), the type and amount of hardware required for supporting these requirements?
Does it have any tricks for enabling quick indexing of new data? (e.g. - index new data in memory until it is indexed in persistent storage)
ES should be able to handle this, though you'd need to scale a fair bit with some hefty hardware.
The best way would be to provision a node of a given size, preferably one you would use in production, then load it up with data and run queries to build a benchmark.
I suspect this will be hard to do. If I have calculated correctly you expect to have small records (~107 bytes), and the total raw data volume indexed per day is around 84TB. It also seems like you are planning to have a retention period of around 3 years based on the total number of records in the system, which means the total raw volume (not indexed or replicated) that need to be kept in the cluster is around 95PB. Given that you also have a quite high query rate with flexible queries against a not specified time period makes me think you will need to reengineer the solution in order to do this in a cost effective way with Elasticsearch.
Thank you both for the response.
Unfortunately I had a typo in one of the requirements - we are looking at 10^12=1,000,000,000,000 records in total. on average each record should be around 105 bytes which leads us to raw volume of 69TB.
Even though in peak times we have to support the indexing of 10M records in 2 seconds, the average scenario is 10 Billion new records per day.
We can also assume that most of the query executions will be due to a repeated execution of the same queries against newly indexed data (we provide the previous execution timestamp as a parameter for the query).
Given the fixed requirements can we estimate the size of the cluster(s)? 10/100/100 nodes?
Even though that sounds much more reasonable, it is still quite a high peak indexing rate combined with a high query rate. Best way to estimate the required cluster size would be to follow Mark's advice and benchmark on a small cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.