Need advise on how to building a 4 billions documents index with Elasticsearch

Hi all, I need to estimate the storage and computing requirements required to deploy an Elasticsearch solution capable of handling a quite large document collection.

Here are some of the key requirements:

  • the repository should index 4 billions of documents
  • the estimated average document size is 15Kb/document in TXT format (approximately 60TB of data)
  • each document will have a set of metadata associated (document date, document title, document category, document ID... approximately 2 other KB of data per each document)
  • every day, 2 million of new documents should be added to the collection and hence be indexed, and a similar amount of documents should be deleted
  • the average number of queries/day could be in the 30,000-50,000 range

In a similar scenario:

  1. What could be the overall storage requirements (indexes, temporary areas for indexing and caches, etc.)?
  2. What could be the recommended configuration for the servers dedicated to indexing processes (number of servers, recommended HW sizing, etc.)?
  3. What could be the recommended configuration for the servers dedicated to searching (number of servers, recommended HW sizing, etc.)?

Any suggestion will be appreciated.. thanks.

Hello... bumping for interest.
Is there anyone with experience in indexing pretty large collections of documents that can provide me some suggestions?

The amount of space the data takes up can vary quite a lot depending on how you need to query it and what mappings you use. The types of queries and acceptable latencies can also have a significant impact. I would therefore recommend you run some benchmarks to find out. Have a look at this Elastic{ON} talk for some guidance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.