We are a new start-up going through development with no funding other than out our own pocket, I am hoping to seek some advice from others who are far more experienced than myself given we have only used Elasticsearch for the last month.
Currently, for development, we have 1 dedicated server handling a single ES node. That server has 128gb of ram (32gb allocated to ES) and a 2tb nvme drive.
We are indexing real-time financial data, our index count after just one week is a little over 2 Billion records. In production we understand we will need a multi-node cluster, our thoughts on this is to run 2-3 dedicated servers all of the same size and specs from the same service provider (ideally in the same rack but certainly on the same private network)
My issue is; Right now after one week of indexing, we have used 50% (1tb) of the nvme drive, its clear that we will be using atleast 4tb each month so the dedicated servers will require bigger drives to handle the multi-node cluster.
My question is;
We will be running aggregations on data roughly up-to 1 month old with most important being daily / weekly data and the older the data becomes the less queries/aggregations will be run and will become more of a search only basis (Eg: User filters data for a given day)
Indexing 4tb of data each month will become very expensive for us as a start-up if we use nvme / ssd storage because i assume each of the 3 dedicated servers we will have in the cluster will need a considerable amount of nvme drives.
For example; we have been given a quote for 3 servers each with same specs we have now but each server will come with 40tb of nvme drives in raid but that cost is approx $3k each / month!
Does anyone think we could potentially get away with using HDD or potentially a mix of nvme & hdd in the cluster so the HDD nodes are specifically for documents older than 1 month? I assume we would need to some how configure the cluster so elasticsearch automatically moves data to HDD nodes when the document is older than 1 month?
Stats after 1 week:
Index size: 2 Billion documents
disk usage: 50% of 2tb nvme
nodes: 1
hosts: 1
data ingestion: near real time
query/aggregations: near real time on daily/weekly documents
For performance I would recommend that you keep all indices that are actively indexed into and the ones most heavily queried on nodes with fast nvme disks as this generally is very disk I/O intensive. Assuming your data is immutable and not updated you can use time-based indices and move older indices that are no longer written to to a different set of nodes with large amounts of slower storage. This is what is often referred to as a hot-warm architecture, but can also be extended to a hot-warm-cold architecture if needed.
This is often the most cost-effective way to handle large amounts of immutable data where the most recent data is queried most frequently.
Thank you very much, Christian, i will look into these now and i very much appreciate your time. The methods you mention sound applicable as the data is time-based thus wont be updated once indexed.
I just tried pricing up a high-availability cluster of this size on Elastic Cloud with 2x240GB hot nodes, 2x5TB warm nodes and it works out at about $1800/mo for the whole cluster, and that's a managed service that includes support and backups and turnkey upgrades and a whole bunch of extra features built in. That seems pretty cost-effective vs the time spent doing it yourself.
@DavidTurner Thanks for the heads up on the elastic managed service, in all honesty, i believe we will migrate to the managed service once we reach production and can cover the additional overhead. We have a fair amount of paying customers on the sidelines who are waiting for the service to go live so i believe we will be looking at the managed service in the near future for sure.
I've spent the past 24 hours looking into what you suggested and for development and going into our first production run i believe that would be a very viable option and once things are up and running we will most likely migrate over to elastic managed service.
During the investigation of our index, i noticed number_of_replicas shows 1:
As far as im aware i have not set any replica up so i believe the default must be 1, does this mean that we potentially have a replica of the data on the same server given we only have 1 server running elasticsearch with a single node / shard? IF that is the case then it could potentially reduce the storage if we remove the replica because our plan is to have a dedicated replica server in production that is seperate from our primary nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.