Hello,
I want to provide different users with different levels of data longevity. ( eg user a only pays me to keep 1 month of data but user b want to keep 12 months)
In order to make this manageable I am planning to have an Index per month per user. I would have index names like 'user_a_2016_jan', 'user_b_2016_jan', 'user_a_2016_feb', 'user_b_2016_feb' etc etc
This way I can just delete old indexes on a per user basis. I plan to implement this as follows.
-
Define a index template for the cluster with my mapping.
-
Write indexing code that indexes to the appropriate index name and relies on auto creation to create the index. This way new customers will have new indexes created for them when they start sending data without any need for prior set up, and indexes will automatically roll over at the start of a new month.
-
Perform any queries for a user by specifying a wildcard index. For example, for user a, I would use index: 'user_a_*'. This way queries for user a will run across all existing user a indexes. I would normally use an alias for something like this, but there doesn't seem to be a way to have a 'user_a' alias for autocreated indexes.
Given that I am building an analytics service and hence will make HEAVY use of aggregations, I have some questions.
-
Is it advisable to use wildcard indexes like this rather than aliases? Am I going to be missing out on some important capability by not having aliases?
-
This approach means that the number of shards I have is dictated by numbers of users, rather than explicit scaling decisions. If I have 200 users, then after a year I will have 2400 indexes. If I set each index to have one primary and one secondary shard, that means 5600 shards. Is this going to be a huge problem? what if I want to run aggregations against _all too?
-
I have a number of fields that have a set of possible values of known cardinality. For example 'type' might be able to be 'foo', 'bar' or 'xxx'. My understanding is that this limited cardinality means these documents can be stored in the index quite efficiently. Am I going to lose that by having a lot of small indexes? ie is it going to require 200 times the space to index the documents across 200 indexes as opposed to one big one?
-
any other gotchas I should be aware of?
Thanks
Perryn