Our cluster setup is going to have ~8-10 data nodes (r3,x2 memory optimized EC2 instances), each having ~2000 shards (per data node) and ~50,000 filtered aliases. What is the recommended configuration for master nodes?
Since we are going to have large # of of shards per data nodes and having lot of filtered aliases, does it makes sense to use IO optimized master nodes in EC2?
In addition, are there any issues with using this much filtered aliases in a cluster?
Our use-cases are such that we can't combine data from different sources in an indexed if the data source is generating significant amount of data. If we combine them, we bear the risk of making individual shards very large.
For sources which are generating very less amount of data, we are combining them into one index and using filtered alias concept.
Based on above, we are expecting to have ~2000 shards and ~50000 filtered aliases.
What's the recommendation of master node configuration in such cases? OR what potential load master node is going to bear with this?
~100GB and still growing. However, if we combine more sources into a single index/shards, we will need to have more filtered aliases. That's the reason why we are thinking about a combination of #shards and #filtered aliases.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.