Cluster design for specific use case

Hi there,

TLDR;
I'm new to ES.
I want to start an elastic cluster for the following use case :
We manage billions of tiny documents. There is basically no documents movements (documents are rarely added) but they can be added millions by millions at a time. We want to perform basic full match searches over a specific field to retrieve the others in a few seconds.
How do I design my cluster in term of :

  • number and type of nodes
  • number of shards
  • number of indices

Long post;

I'm planning to implement an elastic search cluster for an R&D internal project and I struggle to determine how to plan a smart elastic search design (number/type of nodes, number of shards, indexes to use etc). I'm completely new to ES and have never configured a cluster.

Here is the use case :

  • I manage very small documents : 5 to 10 words in size (80 to 150 bytes average I'd say).
    The data is made of usernames and email addresses.
  • I have a large number of documents : 1.5 billions are already stored in a mongodb database and this number is only a start.
    The number of documents might grow to a few billions.
  • Searches on these documents are extremely simple, no aggregation will be done.
    90% of the searches consist of getting back every document matching a fullterm search on a field (email address).
  • However, I'd like to have these simple queries to run fast to have a near realtime results. These queries will be performed by humans, they are very occasionnal (there is never more than one query at a time).
  • New data is occasionnaly added. Huge numbers of documents can be added at a time (from a few millions to hundred of millions). We can't know when the data will be added (external events are at stake), but it will always require manual data processing anyway. When data is added, all the added documents are related and have a common field.

I struggle to find a smart way to design an ES cluster :

  • indices : I basically have only one index in terms of type of data. All the documents represent the same type of data.
  • Sharding : As I have only one index, I would need a lot of shards, added with the fact that I want to search fast in the dataset, it is even more important, right ? Should I find a way to split my data into smaller indices to limit the number of shards ? If I do so, it will end with very unbalanced indices : some will be very small (vast majority) and some will be huge (minority).
    If a lot of shards are used, I must plan to buy hardware able to parallelize well tasks (lot of cores) am I right ? (I read that when doing queries, one cpu thread is used by shard to search in the dataset).

I already plan to use RAID5 SSD (hardware RAID with 4 disks) for each nodes, with 64Gb of RAM (32 for the ES process as more results in less performance due to the JVM doing stuff with pointers optimization/compression).
How many nodes should I use with this base hardware and use case ?

Is elastic search even relevant in this kind of use case, regarding other solutions (hadoop, cassandra etc) ? If yes, what do you recommend in terms of hardware

Thank you for your insights !

Here are couple of links that might help you answer your questions:

From your description it seems that you will have a few hundred GB of data, maybe scaling up to a TB or so in time, and occasionally you will add a few tens of GB of data. These are not especially large numbers so a single node might be sufficient. However the smallest possible resilient cluster has three nodes. The recommendation is to aim for shards that are a few tens of GB in size, so a single 5-6 shard index would seem like a good starting point. If it were me I would start with a three-node cluster and perform some experiments to validate whether that gives me the performance I need.

Thanks for your answer, it is much appreciated.
With your example design I thought that the shards would contain too much data to have results quickly when requesting the cluster.

In my current case, I don't care if the service is unavailable for some technical reason (downtime is not a big requirement as data is always added manually), however data loss is important. Could I accept running everything in a single node cluster with disks in a raid configuration ? Does ES will still perform well ?

I think you should validate this experimentally. Elasticsearch (really, Lucene) is pretty heavily optimised for large datasets.

In terms of performance I think you must perform your own benchmarking, but I would not be surprised if a single node performed well. However a single-node setup does put you at greater risk of data loss than one with completely independent nodes. For instance, what if the RAID controller fails and writes garbage to every disk?

You're right, using replica shards on several nodes would be better for that I guess.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.