What's the tradeoff between number of indexes, and number of documents in each index?
We're going to be indexing tweets about events. Some events just have a few tweets (zero, or less than ten) where as some events have millions. Most events will tend towards the lower end of the spectrum - it's rare to get an event with more than a couple of hundred thousand tweets.
We tend to do analysis on a per-event basis, and we also envisage (eventually) expiring old events from the system (we can always reload the data, if we need them.) We expect to have around 700 events per day.
Do you guys have any thoughts on what a sensible index strategy might be? I don't think One Big Index is a good design choice, from an operational point of view - as we're exploring ES I expect we'll occasionally have to reindex things, and I'd like to be able to do that incrementally, event by event (or at least by groups of events - say a day's worth).
However, I've also seen references in the docs to an overhead incurred by each index, though I've not seen what that overhead is. I therefore suspect that the 700-odd indexes per day created by an index-per-event strategy is also unwise.
I'm inclined, therefore, to start out with an index per day. This will usually tend to have a few hundred thousand documents in.
Does that sound reasonable? As I said, I'm really trying to understand what the nature of the tradeoff between index size and quantity of indexes is.