Cluster recommended for crawling 150 million documents

Shubham_Mittal · January 22, 2020, 7:54pm

Hi Team,

I have to crawl 150 million documents from SQL server database using apache manifold jdbc connector to elasticsearch.
Average size of document is 500KB.
I was thinking of using 5 nodes elastic cluster as below:
All nodes with 8GB RAM and 4 Core CPU and I can create only 1 index.
1 coordinating node
3 master & data nodes
1 dedicated data node

Could you please tell me how should i allocate the shards and whether can i reduce the number of nodes to reduce the hardware cost in this case??
Also what could be the approximate hard disk space utilisation for indexing 150 million documents.??
How I can index such huge data from multiple tables in sql server database to elasticsearch in less time??

Christian_Dahlqvist · January 22, 2020, 8:11pm

How did you arrive at this cluster size? Is that 500kB of text per document?

Shubham_Mittal · January 22, 2020, 8:18pm

I need to have failover support for elasticsearch.
Since recommended master eligible nodes are 3, therefore I decided to go with 5 nodes cluster as I will be needing a client node as well.

Yes 500KB is the actual content size of one document and we can assume that the content would be quite different in 150 million documents.

Christian_Dahlqvist · January 22, 2020, 9:28pm

150 million documents of 500kB each gives a total volume of around 70TB if I have calculated correctly. You then need to store data twice in order to get high availability. I have no idea how much space that will take up on disk as that will depend on your data as well as mappings. I do however think it is fair to say that you will need a significntly larger cluster than what you described.

I would recommend indexing a subset of the data into an index with one or a few shards to see how the raw data volume translates to index size on disk. This will also give you an idea through stats about how much heap your data set requires. Based on that you should be able to better estimate the size of the cluster you need.

DeeeFOX · January 23, 2020, 5:39am

num_primary_shards = num_nodes per index and enable the best_compression to reduce the space cost by following the es ref
To our log-storing-case (common web app logs), it cost about 300GB/500million with 1 primary 1 replica, each doc about 10~20KB
Using bulk api, but to a 70TB raw size data, i believe it must be a days-time job

Between, 500KB per doc is quite a huge size, It must be some long paper i guess, haha

Wish it can help u

Shubham_Mittal · January 29, 2020, 1:04pm

Hi Team,

When I am using 5 data nodes in my cluster, I am thinking of having 4 primary shards and 3 replica shards for 1 index. I have 18 indexes in my cluster, so total there would be 18 *4 =72 shards on each node. Will it work fine in case of total 1 TB data or more... with 64gb ram and 1TB storage of each node.

I want to use snapshot and restore API.
My doubt are as follows:

When I will take multiple snapshots of my cluster having 1 or multiple indexes every 15 minutes,
at the time of restoring snapshot, do I need to restore all the snapshots taken or just the last snapshot??
How can I schedule the snapshots ?

system · February 27, 2020, 6:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuration of elasticsearch to index 300 Million documents Elasticsearch	4	5916	July 6, 2017
ElasticSearch space Requirement for productions Elasticsearch	4	1516	July 5, 2017
Setting up elasticsearch to scale: shards per index Elasticsearch	9	531	July 6, 2017
Recommended Hardware Specs & Sharding\Index Strategy Elasticsearch	13	812	July 6, 2017
How big can/should you scale Elasticsearch Elasticsearch	3	1657	July 6, 2017

Cluster recommended for crawling 150 million documents

Related topics