I have following requirements, that appear to be not so typical:
Indexing up to ~5000 documents per second (small docs, (semi) structured, needs to reliable!)
Exporting / Querying of big data sets up to 5*10^6 documents (do not have to be fast)
Persisting data for many years (time based indices are not an option)
Currently we have following stratgey :
- single node (vertically scalable)
- no replica
- 1 index per data source / origin (typically 100mb to 1 gb)
- 1 shard per index
- 1 index schema
- no ilm but we close indices after a while (writing to a specific index end after a few weeks; queries must be possbile for months/years) and reopen them on demand
With this approach we eventually end up having hundres of indices, where most of them are closed (no indexing any more) and up to about 100 of them are open. This also means having hundreds of shards for a single node.
Currently queries focus on a single index / subset of index data but we will need support for queries over multiple indices (which is possible using search api).
I am wondering if there might be a better strategy overall as I know that indices should be organized by data structure rather than data origin and that we shouldn't have too many shards on a single node. Also shards should be in 10-50? Gb range - not ~ 1Gb.
On the other hand I don't think that creating a huge (!) single index when querying only small subsets of data. And specifying the amount of shards up front basically eliminates vertically scaling as we cannot simply change heapsize, right?
So far the current strategy does not perform too bad. Running a single node with about 500gb data still works but we can expect more query load in future and we should come up with a better strategy I suppose.