Single huge index v/s daily or weekly index, which is better?


(Reddaiah nethi) #1

Hi There,

We have an indexing which is growing around 500 TB per week.

Currently, we have the size of 2 TB and have the 3 replicas, which is taking around 20-30 mins for indexing a 750 MB document. And lot of files to upload piled up and unable to catchup.

We have 10 node cluster (Windows Azure VMS)with 4 data, 3 master and 3 client. Data Nodes of size 56 GB RAM and 8 Cores.

What we really want to find out is, will be the daily,weekly, monthly indexes is the better option than a single huge index?

If have smaller indexes, will maintaining the indexes will be an issue in the longer period? If yes, what sort of challenges can we expect. ?


(Jymit Singh Khondhu) #2

How many primary shards along with these three replicas?
How large would one days index be - Have you tested the indexing/searching in this capacity?


(Christian Dahlqvist) #3

Can you tell us a bit about the use case? Is it read and/or write heavy? Do you update documents or are documents generally immutable?


(Reddaiah nethi) #4

@JKhondhu We have 20 Primary Shards, 1 days index will be around 5 to 8GB, and will gradually increase, around 10% every week or so.

We have not yet tried indexing the day level indexing, yet. Just want to know, the benefits and complications with it before we try.


(Reddaiah nethi) #5

@Christian_Dahlqvist Our index will be write heavy i.e around 5-8GB per day for each index, and read heavy too. and the documents are immutable.

Will load the data once day, but currently it is running all time, as the indexing is pretty slow. Its a platform which will be used by 100 members atleast. but may not be concurrent.


(Christian Dahlqvist) #6

I answered your other question and think you will benefit from switching to time-based indices. This allows a smaller set of indices to be targeted if you are only looking at data within a limited time frame.

The ideal time period an index should cover varies by use case. Adjust the number of primary shard based on the number of nodes in the cluster (to spread data out) as well as volume indexed per day. Make sure you do not end up with too small or too large shards. Having large number of very small shards is inefficient as each shard has some overhead and too large shards can affect query performance as well as recovery.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.