Hi,
I am planning to develop a full-text search engine using Elasticsearch 8.x to handle approximately 220 million records, totaling around 400 GB of data. Each document consists of a mix of full-text fields and filter fields, with the following schema:
Full-text fields: title, abstract
Filter fields: journal_title, published_year, fields_of_study, is_open_access, citation_count (for filtering, faceting, and sorting).
No nested documents, parent-child relationships, or vector search are planned.
Use Case Details:
Queries:
Full-text searches on title and abstract.
Filtering on journal_title, published_year, fields_of_study, and is_open_access.
Sorting and aggregations for faceting.
Typical result sets are expected to range from hundreds to thousands of documents.
Indexing Rate:
Initial bulk indexing of 220 million records.
After the initial load, we anticipate 10,000 updates/additions per month.
Performance Requirements:
Handle up to 500 concurrent queries during peak times.
Desired query latency: under 1 sec.
Hardware:
No fixed hardware constraints; flexible on provisioning nodes based on recommendations.
Given this, I’d like guidance on the following Elasticsearch configurations:
Number of shards for optimal performance.
Number of nodes (servers) required for scalability.
Heap size
Node capacities in terms of RAM and CPU cores.
I’d appreciate any insights or best practices based on similar use cases. Thank you in advance for your help!