Hi there,
TLDR;
I'm new to ES.
I want to start an elastic cluster for the following use case :
We manage billions of tiny documents. There is basically no documents movements (documents are rarely added) but they can be added millions by millions at a time. We want to perform basic full match searches over a specific field to retrieve the others in a few seconds.
How do I design my cluster in term of :
- number and type of nodes
- number of shards
- number of indices
Long post;
I'm planning to implement an elastic search cluster for an R&D internal project and I struggle to determine how to plan a smart elastic search design (number/type of nodes, number of shards, indexes to use etc). I'm completely new to ES and have never configured a cluster.
Here is the use case :
- I manage very small documents : 5 to 10 words in size (80 to 150 bytes average I'd say).
The data is made of usernames and email addresses. - I have a large number of documents : 1.5 billions are already stored in a mongodb database and this number is only a start.
The number of documents might grow to a few billions. - Searches on these documents are extremely simple, no aggregation will be done.
90% of the searches consist of getting back every document matching a fullterm search on a field (email address). - However, I'd like to have these simple queries to run fast to have a near realtime results. These queries will be performed by humans, they are very occasionnal (there is never more than one query at a time).
- New data is occasionnaly added. Huge numbers of documents can be added at a time (from a few millions to hundred of millions). We can't know when the data will be added (external events are at stake), but it will always require manual data processing anyway. When data is added, all the added documents are related and have a common field.
I struggle to find a smart way to design an ES cluster :
- indices : I basically have only one index in terms of type of data. All the documents represent the same type of data.
- Sharding : As I have only one index, I would need a lot of shards, added with the fact that I want to search fast in the dataset, it is even more important, right ? Should I find a way to split my data into smaller indices to limit the number of shards ? If I do so, it will end with very unbalanced indices : some will be very small (vast majority) and some will be huge (minority).
If a lot of shards are used, I must plan to buy hardware able to parallelize well tasks (lot of cores) am I right ? (I read that when doing queries, one cpu thread is used by shard to search in the dataset).
I already plan to use RAID5 SSD (hardware RAID with 4 disks) for each nodes, with 64Gb of RAM (32 for the ES process as more results in less performance due to the JVM doing stuff with pointers optimization/compression).
How many nodes should I use with this base hardware and use case ?
Is elastic search even relevant in this kind of use case, regarding other solutions (hadoop, cassandra etc) ? If yes, what do you recommend in terms of hardware
Thank you for your insights !