We have an indexing which is growing around 500 TB per week.
Currently, we have the size of 2 TB and have the 3 replicas, which is taking around 20-30 mins for indexing a 750 MB document. And lot of files to upload piled up and unable to catchup.
We have 10 node cluster (Windows Azure VMS)with 4 data, 3 master and 3 client. Data Nodes of size 56 GB RAM and 8 Cores.
What we really want to find out is, will be the daily,weekly, monthly indexes is the better option than a single huge index?
If have smaller indexes, will maintaining the indexes will be an issue in the longer period? If yes, what sort of challenges can we expect. ?
@Christian_Dahlqvist Our index will be write heavy i.e around 5-8GB per day for each index, and read heavy too. and the documents are immutable.
Will load the data once day, but currently it is running all time, as the indexing is pretty slow. Its a platform which will be used by 100 members atleast. but may not be concurrent.
I answered your other question and think you will benefit from switching to time-based indices. This allows a smaller set of indices to be targeted if you are only looking at data within a limited time frame.
The ideal time period an index should cover varies by use case. Adjust the number of primary shard based on the number of nodes in the cluster (to spread data out) as well as volume indexed per day. Make sure you do not end up with too small or too large shards. Having large number of very small shards is inefficient as each shard has some overhead and too large shards can affect query performance as well as recovery.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.