Help with manging storage for a large index

predev · May 7, 2021, 9:00pm

Hi,

We are a new start-up going through development with no funding other than out our own pocket, I am hoping to seek some advice from others who are far more experienced than myself given we have only used Elasticsearch for the last month.

Currently, for development, we have 1 dedicated server handling a single ES node. That server has 128gb of ram (32gb allocated to ES) and a 2tb nvme drive.

We are indexing real-time financial data, our index count after just one week is a little over 2 Billion records. In production we understand we will need a multi-node cluster, our thoughts on this is to run 2-3 dedicated servers all of the same size and specs from the same service provider (ideally in the same rack but certainly on the same private network)

My issue is; Right now after one week of indexing, we have used 50% (1tb) of the nvme drive, its clear that we will be using atleast 4tb each month so the dedicated servers will require bigger drives to handle the multi-node cluster.

My question is;

We will be running aggregations on data roughly up-to 1 month old with most important being daily / weekly data and the older the data becomes the less queries/aggregations will be run and will become more of a search only basis (Eg: User filters data for a given day)

Indexing 4tb of data each month will become very expensive for us as a start-up if we use nvme / ssd storage because i assume each of the 3 dedicated servers we will have in the cluster will need a considerable amount of nvme drives.

For example; we have been given a quote for 3 servers each with same specs we have now but each server will come with 40tb of nvme drives in raid but that cost is approx $3k each / month!

Does anyone think we could potentially get away with using HDD or potentially a mix of nvme & hdd in the cluster so the HDD nodes are specifically for documents older than 1 month? I assume we would need to some how configure the cluster so elasticsearch automatically moves data to HDD nodes when the document is older than 1 month?

Stats after 1 week:
Index size: 2 Billion documents
disk usage: 50% of 2tb nvme
nodes: 1
hosts: 1
data ingestion: near real time
query/aggregations: near real time on daily/weekly documents

Many thanks, appreciate any help i can get

Christian_Dahlqvist · May 7, 2021, 10:04pm

The first thing I would recommend you do, if you have not already, is to optimize your mappings to minimize the index size.

For performance I would recommend that you keep all indices that are actively indexed into and the ones most heavily queried on nodes with fast nvme disks as this generally is very disk I/O intensive. Assuming your data is immutable and not updated you can use time-based indices and move older indices that are no longer written to to a different set of nodes with large amounts of slower storage. This is what is often referred to as a hot-warm architecture, but can also be extended to a hot-warm-cold architecture if needed.

This is often the most cost-effective way to handle large amounts of immutable data where the most recent data is queried most frequently.

predev · May 7, 2021, 10:17pm

Thank you very much, Christian, i will look into these now and i very much appreciate your time. The methods you mention sound applicable as the data is time-based thus wont be updated once indexed.

DavidTurner · May 8, 2021, 8:19am

I just tried pricing up a high-availability cluster of this size on Elastic Cloud with 2x240GB hot nodes, 2x5TB warm nodes and it works out at about $1800/mo for the whole cluster, and that's a managed service that includes support and backups and turnkey upgrades and a whole bunch of extra features built in. That seems pretty cost-effective vs the time spent doing it yourself.

predev · May 8, 2021, 5:19pm

@DavidTurner Thanks for the heads up on the elastic managed service, in all honesty, i believe we will migrate to the managed service once we reach production and can cover the additional overhead. We have a fair amount of paying customers on the sidelines who are waiting for the service to go live so i believe we will be looking at the managed service in the near future for sure.

predev · May 8, 2021, 5:32pm

Hi Christian,

I've spent the past 24 hours looking into what you suggested and for development and going into our first production run i believe that would be a very viable option and once things are up and running we will most likely migrate over to elastic managed service.

During the investigation of our index, i noticed number_of_replicas shows 1:

    "number_of_shards" : "1",
    "number_of_replicas" : "1",

Our elasticsearch.yml file on the server for development is:

cluster.name: lab
node.name: lab1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.port: 9200
discovery.type: 'single-node'
indices.query.bool.max_clause_count: 8192
search.max_buckets: 250000
action.destructive_requires_name: 'true'
reindex.remote.whitelist: '*:*'
xpack.monitoring.enabled: 'true'
xpack.monitoring.collection.enabled: 'true'
xpack.monitoring.collection.interval: 30s
xpack.security.enabled: 'true'
xpack.security.audit.enabled: 'false'
node.ml: 'false'
xpack.ml.enabled: 'false'
xpack.watcher.enabled: 'false'
xpack.ilm.enabled: 'true'
xpack.sql.enabled: 'true'

As far as im aware i have not set any replica up so i believe the default must be 1, does this mean that we potentially have a replica of the data on the same server given we only have 1 server running elasticsearch with a single node / shard? IF that is the case then it could potentially reduce the storage if we remove the replica because our plan is to have a dedicated replica server in production that is seperate from our primary nodes.

Once again, many thanks and appreciate your time.

system · June 5, 2021, 5:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Questions about architecting and design Elasticsearch	2	377	July 6, 2017
Bulk indexing terabytes of time-based data Elasticsearch	5	2325	July 6, 2017
Hardware Recommendation Elasticsearch	13	90242	December 31, 2016
Multitier Storage with ES Elasticsearch	5	2513	July 5, 2017
Cluster optimization(indexing/query performace) Elasticsearch	4	346	July 6, 2017

Help with manging storage for a large index

Related topics