Elasticsearch Configuration for 220 Million Records (400 GB Data)

Hi,

I am planning to develop a full-text search engine using Elasticsearch 8.x to handle approximately 220 million records, totaling around 400 GB of data. Each document consists of a mix of full-text fields and filter fields, with the following schema:

Full-text fields: title, abstract
Filter fields: journal_title, published_year, fields_of_study, is_open_access, citation_count (for filtering, faceting, and sorting).
No nested documents, parent-child relationships, or vector search are planned.
Use Case Details:
Queries:
Full-text searches on title and abstract.
Filtering on journal_title, published_year, fields_of_study, and is_open_access.
Sorting and aggregations for faceting.
Typical result sets are expected to range from hundreds to thousands of documents.
Indexing Rate:
Initial bulk indexing of 220 million records.
After the initial load, we anticipate 10,000 updates/additions per month.
Performance Requirements:
Handle up to 500 concurrent queries during peak times.
Desired query latency: under 1 sec.
Hardware:

No fixed hardware constraints; flexible on provisioning nodes based on recommendations.
Given this, I’d like guidance on the following Elasticsearch configurations:

Number of shards for optimal performance.
Number of nodes (servers) required for scalability.
Heap size
Node capacities in terms of RAM and CPU cores.

I’d appreciate any insights or best practices based on similar use cases. Thank you in advance for your help!

Is that the raw size of the documemnts or the estimated size after you have indexed them?

It sounds like you will need to query fileds in different way, so you may need to use multi-fields to accomodate different mapping requirements. This can increase the size of your index, so it is important you index a few GB of data and check how size on disk compares to the raw size (if that is what was given).

That is very little, so we can consider this a pure serach use case.

That is quite high, so you either will need very fast disks (read NVMe SSDs), have enough RAM so all data fits in the operating system page cache or scale the cluster out with additional replicas (or maybe a combination of these). As latencies will depend on the type of queries as well as the document size I would recommend you run some tests/benchmarks.

To answer these questions I believe you will need to run some benchmarks based on realistic queries and data. Generally you want to have shards somewhere between 30GB and 50GB in size but it is use-case specific so your numbers may vary.

This talk is very, very old and a lot has changed since then. It does however discuss general benchmarking methodology, which may be useful.

This webinar discusses benchmarking and how to use the Rally benchmarking tool.

Are you planning on hosting this on premise or in the cloud?

I would recommend that you set up a small cluster on nodes with a good amount of RAM and fast storage. Then index data into shards of e.g. 30GB in size and run a benchmark with a small number of concurrent queries against this. Verify that these queries meet your latency SLAs. Then gradually increase the amount of data on the nodes and/or the number of concurrent queries.

If you can have the full data set in memory you can get away without fast disks and this should give optimal performance but will require a lot of RAM. This is generally the best way to support high query concurrency as disk I/O is a common bottleneck.

Once you have reached the limit of the cluster in terms of how much data it can hold and the number of concurrent queries it can support you can either alter the hardware profile or scale out.

@Christian_Dahlqvist , thanks for the detailed explanations.

I am going to self-host this setup.

Given that your data set is large, but not huge, I would probably perform a test in the cloud with some machines with large amounts of RAM, e.g. r8g instances or similar, and deploy a single node per host. I would set up a small cluster of 3 nodes and see how far I could push this and while keeping all data in the page cache. If you can avoid disk IO you might end up limited by CPU and be able to serve large number of concurrent queries with only a few nodes.

2 Likes