I'm researching scalability costs for an elasticsearch search engine project. I understand some hardware requirements on a small scale, large scale I'm having a hard time wrapping my head around it.
I have 100 TB worth of data that will be rotated bi-weekly. Hot/cold storage isn't a factor here. About 3 people max will ever use it at the same time. My gut instinct says that 3 master nodes with 30 gb shards is not enough, even if the VMs have high CPU/ram and enough storage to store a single copy.
I'm starting to think more nodes, with lesser disk space and lesser CPU cores might be the way to go. Something like 30 VMs, 30 GB shards and 4 TB storage each. The c2-standard-16 instance type from Google Cloud looks interesting to me.
Primary queries will be full text based (HTML content), regex, and filtering by certain fields in JSON formatted data. There are no cost restraints, of course lower is better. Do I have too many VMs? Should there be more? What do you think?
Sounds about right to me, although it's unclear if your 100TiB dataset size accounts for replicas or not. If your dataset is unchanging while it's being searched (and the performance is acceptable to you) then it might be more efficient to use partially-cached searchable snapshots instead. You might only need one node if you did that.
When I ingest the data I double check if there is an existing copy of the data. If so, update a field in the document and continue.
Here is some background:
I have 100 TiB worth of data that contains uniquely identifying information. In my mapping schema I set index to false, preventing the data from being displayed while still making it searchable. When a duplicate is found, update the date time field and append data to an array and upload the partial document. There is no need for replicas as (in general) the data will be cycled on a 2 week basis.
The chance an existing document is updated while being searched is low, even if it was I'm not worried about all the source fields being up to date. The most important data is the hidden unique data.
The original plan is to have multiple indices, one for each source of data, if I combine everything into a single index I can reduce duplicate data even more, but risking a long refresh time, perhaps it can be done overnight if I follow the RAM to Storage ratio for the storage dense category here: https://www.elastic.co/guide/en/cloud/current/ec-gcp-configuration-choose.html
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.