Cluster recommended for crawling 150 million documents

Hi Team,

I have to crawl 150 million documents from SQL server database using apache manifold jdbc connector to elasticsearch.
Average size of document is 500KB.
I was thinking of using 5 nodes elastic cluster as below:
All nodes with 8GB RAM and 4 Core CPU and I can create only 1 index.
1 coordinating node
3 master & data nodes
1 dedicated data node

  1. Could you please tell me how should i allocate the shards and whether can i reduce the number of nodes to reduce the hardware cost in this case??
  2. Also what could be the approximate hard disk space utilisation for indexing 150 million documents.??
  3. How I can index such huge data from multiple tables in sql server database to elasticsearch in less time??

How did you arrive at this cluster size? Is that 500kB of text per document?

  1. I need to have failover support for elasticsearch.
  2. Since recommended master eligible nodes are 3, therefore I decided to go with 5 nodes cluster as I will be needing a client node as well.

Yes 500KB is the actual content size of one document and we can assume that the content would be quite different in 150 million documents.

150 million documents of 500kB each gives a total volume of around 70TB if I have calculated correctly. You then need to store data twice in order to get high availability. I have no idea how much space that will take up on disk as that will depend on your data as well as mappings. I do however think it is fair to say that you will need a significntly larger cluster than what you described.

I would recommend indexing a subset of the data into an index with one or a few shards to see how the raw data volume translates to index size on disk. This will also give you an idea through stats about how much heap your data set requires. Based on that you should be able to better estimate the size of the cluster you need.

  1. num_primary_shards = num_nodes per index and enable the best_compression to reduce the space cost by following the es ref

  2. To our log-storing-case (common web app logs), it cost about 300GB/500million with 1 primary 1 replica, each doc about 10~20KB

  3. Using bulk api, but to a 70TB raw size data, i believe it must be a days-time job

Between, 500KB per doc is quite a huge size, It must be some long paper i guess, haha

Wish it can help u

Hi Team,

When I am using 5 data nodes in my cluster, I am thinking of having 4 primary shards and 3 replica shards for 1 index. I have 18 indexes in my cluster, so total there would be 18 *4 =72 shards on each node. Will it work fine in case of total 1 TB data or more... with 64gb ram and 1TB storage of each node.

I want to use snapshot and restore API.
My doubt are as follows:

  1. When I will take multiple snapshots of my cluster having 1 or multiple indexes every 15 minutes,
    at the time of restoring snapshot, do I need to restore all the snapshots taken or just the last snapshot??
  2. How can I schedule the snapshots ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.