Background: Presenting ElasticSearch Data Models
There are many possible data models available on ElasticSearch, but obviously using ElasticSearch as a time series date store is the most popular. Time series data is usually immutable (old data almost never get updated later on), old data is usually deleted at once and the write throughput is usually much higher than the read/search throughput, The granularity of the search is usually the only last few hours or days. That's why people like to use ElasticSearch for logs indexing, and it's done by creating a separate index per fixed time range (each day/week/month has its own index).
Our Problem: Support Multiple Indices Instead of One
Our use in ElasticSearch is quite different. We're not using ElasticSearch as a time series store, but instead as real time adaptive analytics store. My write throughput is basically low, however my read granularity is quite big - which means that in most of the times we want to query all the data in the last month.
For demonstration purposes, let's assume we have a movie actors data store. Each document describe one actor, and the related (nested) documents describe the movies whom that particular actor participated in. The actor details are getting updated once and while, but it's rare. On the other hand, at any time we can attach a new movie to an actor - no matter when this actor created in the first place. We obviously don't delete movies which were attached to an actor in the past, and obviously active actors are getting updated and queried more than less active ones.
Back to ElasticSearch data model, the basic approach here would be creating a large single index. BUT, our goal is to somehow DO use multiple indices per time frames. I do know that searching 20 indices with 1 shard, is equivalent to search 1 index with 20 shards. So the main motivation for this is maintenance: easily scaling out (creating new indices with more shards), easily deleting old data at once, easily re-indexing the data in chunks (each index separately instead of all together), run optimize procedures and so forth.
##Dealing with the Write Process
In our system, we cannot allow (in my opinion) that one actor would be described in 2 different documents. That would lead to unexpected and inaccurate results when using aggregations.
We can define a main write index, and yet define all the indices created so far as secondary write indices. That means that when we want to add a new information about an actor, we must first check if he/she exists in one of our indices. If it does exist, we would update the document in that index. If it doesn't exists, we would create a new document in the currently main write index.
We can choose to ignore the problem, meaning that we always write to the currently write index. But, we make a background process which runs every night and settle things up.
A smarter approach would be somehow skip the querying part by maintaining a dedicated DB to model the relation between an actor and its related index. We could even try to create special id for each actor, which allow to easily parse it and to know just by the id which index (time frame) related to it.
Dealing with the Search Process
If we want to query all the latest actors added to our system - it's easy. We can just query the last created indices. But, what should we do when we want to query all the latest created movies (our nested documents)?
The trivial approach would be querying all the indices in the cluster - and then to filter the nested documents by the latest created documents (remember that nested documents can't be accessed directly without first accessing the main documents).
A much efficient approach would be that each main document (actor document) would maintain a list of dates of related movies. In this approach we still query all the indices in the cluster, but this would allow us to at least filter the relevant main documents in each index (which linked to desired movies documents). It's important to mention that ElasticSearch caching mechanism is quite smart, so maybe this approach is useless and solution 1 might even get better performance.
Index the movies in a parent/child model, which allows to access directly to the movies. The problem with that is that nested are better approach when considering query performance. We also make a lot use in nested/reverse_nested aggregation for making "joins" in our queries and there is no "Parents" aggregation (only children aggregation).
Use routing field. This could have been a key decision here, since this allows the query to easily run only on the relevant shards. The problem with that is that a document has only a single routing field. Let's say I want a document to be associate with "YoungActors_ActiveActors" routing field. How can I run a search only on "YoungActors" shards? Is there a regex support? What If I add a "tag" field in each document that contains the different "routing" categories and make a term search for only these documents on each search? Yes, I still search all the shards, but maybe there's an optimization way to tell ElasticSearch to deny a document if it's not in the correct "tag" - and not even try to match the rest of the query on it. I imagine something like 4 AND conditions in bool clause, that the first of them fail.
So, what do you think?
I'm thrilled to know your opinions and insights!