What are the suitable AWS instance types and other recommendation for heavy search and aggregation? I will have ~10 millions of document in single index with size around 70TB. I am planning to use 45 data nodes, 4 coordinating nodes and 3 master nodes. cluster. My high level use case are:
Implemented parent child concept.
Heavy nested object in single document.
High cardinality and heavy groupby operation
Deep pagination
Implemented n-gram
No. of user execute query parallel are 5-10.
single document will have 600+ fields and size of single doc will be ~40KB
Note that all of these add overhead and as far as I know generally go against best practices for optimal performance.
Having said that it seems like you have a large data set that will not fit in the page cache so I suspect disk I/O might become the bottleneck. For this reason I would recommend running a test/benchmark on i8g.2xlarge / i4i.2xlarge (or similar) instances.
Thanks for the response. My followup question here is "I noticed that when we hit heavy search and aggregation query most of the time consume by coordinating note to aggregate the result from shards. So, is there any config/suggestion/way to reduce the time in coordinating node?
Not really. The amount of work it need to do depend on your queries, data and mappings and as I pointed out you are using a lot of "expensive" features and patterns. Am not sure what instance type is best for coordinating-only nodes for your use case as I have not worked with any usecase similar to yours and try to avoid the patterns/features you are using whenever I can.
If you want to change that to make it more efficient believe you will need to change how you index and query data. I doubt there is any magical silver bullet for this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.