So I tested my data on a single shard and it seems to perform adequately. Would single shard affect parallelism or doing ops in parallel like bulk indexing and search at same time?
I have 4 physical machines with 32 cores each and ES_HEAP = 30gb each with 5tb ssds each
Yeah just wondering because right now I'm doing Index per day with 4 shards + 1 replica which equals 1460 shards per node if I'm right? And thats taking up lots of ram.
Wondering if I should reduce it 1 shard + 1 replica the indices will be spread throught the cluster so that will help utilise those cores too, no?
And I have data retention policy of 3 years. So thinking maybe going monthly index, but how many shards is big question because it will take me for ever to fill a month worth of data.
Isssh lol right now monthly index with 4 shards + 1 replica is almost 250GB per shard hehe
Also trying 8 shards per month but it's not full yet. So I'll check. But i assume it will be half of the above.
3 billion documents / 8 million docs/per day = 375 shards
Average document size is about 4k-8k depending on document type.
I need to find right balance between shards and indices and ram. but it takes for ever to index that many documents.
you have a retention of 3 years but you continue to made search querys in this week index ? if the answer is no, you can close the index, close index only consume disk space. and you can open and close a index really easy! personally i let 8 weeks open and if the customer need old information (scroll) i open the n index... https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.