Creating a High Throughput Elasticsearch cluster

We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.

Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.

Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.

Our use case is as follows :

  1. We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.

We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.

  1. Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. There are terms and range queries. To optimize range queries I'm using scaled_float as we don't need much precision there. We are aiming for sub second query times for 95th percentile of queries.

I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.

Following are my learnings :

  1. Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .

  2. Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.

  3. Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).

  4. Have co-ordination nodes as well to take bulk index and search requests.

Now we are planning to begin with following configuration:

3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 4 VCores, 240 Gb RAM and 6 * 375 GB SSDs

We have a plan to adding 1 Data Node for each incoming client.

Now since hardware is out of windows, lets focus on indexing strategy.

Few best practices that I've collated are as follows :

  1. Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
  2. Have 1 replication to survive loss of nodes.

Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.

  1. During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.

  2. We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.

  3. I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.

  4. We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.

Now to my actual questions :

  1. Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?

  2. We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?

Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?

  1. While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.

  2. Is this entire architecture a step in right direction to have a scalable and high performing cluster ?

Please let me know in case you need any further information.





This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.