We're looking at indexing a flow of social media documents (60+ million per
day), where we're going to store docs for a couple of weeks. Most of the
queries to this setup will be for documents from the last 24 hours, but
some will go further back (for bulk analysis). The time ranges of queries
vary, some only one hour and some longer.
The thing we're worried about is getting hot shards or nodes.
To handle the document flow we're using one index per date, with as many
shards as we have ES nodes (and some replicas). We're also routing
documents based on the hour of day in their timestamp, to limit the search
space for queries.
With this setup, indexing will go to a single shard every hour, which also
happens to be the shard that most queries hit. Having many replicas can
distribute the hits over the nodes, but the shard will still be hot.
We've been thinking of more fine-grained indices (hourly), but this creates
the issue of possibly too many indices (file handles and overhead). The
fine-grained indices are also probably overkill for the 90% of data that is
older than most queries will hit. There is no logical routing below hour of
day, as well - but this might not be a problem since the index is much
smaller. Still, network traffic could be a problem instead.
Another option is to have fine-grained indices for the last few days, then
concatenate indices for the old data - but this seems expensive (it's
reindexing several gigs of documents).
Am I over-thinking the problem of hot shards/nodes?
Also, what could be a good routing strategy that is not time-dependent?