I have 150 thousands fathers and each father has 20 millions children
there is only one type of document.
document describes child and its father.
I routed the documents by father id but I am not sure if it's a good idea because no matter how much I scale my project to another nodes, each node will have minimum dozens of millions documents.
I have so much fathers.
if my only desire is to get low latency and high search speed, is it the right choice to organize my data that way?
I thought about giving one index for each family
but someone told me that it's a bad idea to create so much indices( 150 thousands) in one cluster
what is the right choice?
Each document has 40 fields
1gb per 5 million
dozens of TB every year
Maximum 30 concurrent searches
Target latency have to be less than 20 sec
Filter queries only
Just 2 text fields out of the 40 fields
Latency less than 20 sec
Search speed is the only thing that matters
Search can be on father’s children or on all the children of all the fathers
Thanks for answering
Liron.
Given the size of your data it is important that you go through this guide and really optimize your mappings. make sure you are using exactly the mappings you need to be able to search (not necessarily the default mappings), do not index fields you do not need to search on and enable best_compression. Once you have done this, index perhaps 20 million documents and check how much space that takes up on disk.
When it comes to indexing and sharding it is important to know that querying 1 index with 1000 shards is basically the same as querying 1000 indices each with 1 shard. As you have queries that will have to target all your data, these will be the slowest and most resource demanding ones. As you need to support concurrent queries you generally want your shards to be as large as possible. Having one parent per index will give you a shard size of less than 4GB, which is quite small. You are therefore probably better off having a single index with a reasonably large number of primary shards. When indexing you should use routing, as this allows queries targeting a single father to be as efficient as possible and minimize resource usage.
If we assume improved mappings reduce the index size by 40% each father will on average take up 2.4GB on disk (primary shards). This gives a total size of primary shards of 360TB, which is a lot. With a shard size of 50GB, this means 7200 primary shards. If we add a replica shard for resiliency that is 720TB total storage and 14400 shards. If we assume a single node can serve 4TB of data and maintain query latencies and throughput, that gives you a cluster size of 180 data nodes.
These are just assumptions, and the total data size as well as how much data a node can hold and handle will depend on your data, queries and hardware. You need to benchmark to an accurate estimate as a small change can make a big difference. For this I recommend the following resources:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.