Hi Carl
- Is there a good way to handle single node restarts? I've dug
through past posts and can't find anything very useful. We're using
the local gateway (and local store), so it should be possible for a
single-node restart to trigger a minimal amount of shard reallocation.
I am sure there's a way to delay shard reallocation with the
appropriate settings, but I haven't figured out the right mixture yet.
Any pointers would be great!
You can use the cluster update settings API
to set cluster.routing.allocation.disable_allocation before restarting,
which will ensure that shards are not reallocated. Just remember to
reenable it after your node has started up again.
- Handling index growth: I watched Shay's talk on different data
flows, and our current setup is most like the "users flow": all
documents have a user and most queries are restricted to a single
user, so we set the routing parameter to be the user id. As our index
grows, I am worried that individual shards will get to be too large.
Most of our queries are over all of a user's data, so creating a new
index per-week (say) doesn't seem like the best solution, as it
eliminates the benefits of routing.
If your use case fits the index-per-user model, then don't worry about
the time-based model.
The key to this flexibility is aliases.
For example, you have a user 'foo' who starts out in your general
all-users index. You can set up two aliases, one for writing and one for
reading (I'll explain why later):
curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"add" : {
"index" : "all_clients",
"alias" : "foo_write",
"routing" : "foo"
}
},
{
"add" : {
"index" : "all_clients",
"filter" : {
"term" : {
"client_id" : "foo"
}
},
"alias" : "foo_read",
"routing" : "foo"
}
}
]
}
'
The above means that:
- all documents for this client will be stored on a single shard in
your all_clients index - when querying, the filter {client_id == 'foo'} will be automatically
applied to the results, so the 'foo' client appears to live in its
own index
So why two aliases?
You can only write to a single index (or an alias
which points to a single index), but you can query multiple indices. So
to avoid making changes (like the ones explained below) to your
application in the future, start out using separate foo_write and
foo_read aliases.
Now, the client has grown large enough to warrant their own index.
You can create a new index 'foo_v1' and then adjust your aliases so that
foo_write points to the new index, and foo_read points to BOTH:
curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"remove" : {
"index" : "all_clients",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_read"
}
}
]
}
'
So now, all writes will go the the new foo_v1 index, but queries will go
both to foo_v1 and the old "all_clients/routing:foo/client_id==foo"
index as well.
The only thing that you need to be careful about is getting and updating
existing docs.
The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:
- try the new index first, and if that fails to find the doc, try the
old index - use a query instead of a GET
- move the old data into the new index
Similarly, when you update/reindex an existing doc you must either:
- store it in the same index that it came from, or
- delete it from the old index and save it to the new index
clint
--