I'm testing a production ElasticSearch system, and I have two
administrative questions:
Is there a good way to handle single node restarts? I've dug through
past posts and can't find anything very useful. We're using the local
gateway (and local store), so it should be possible for a single-node
restart to trigger a minimal amount of shard reallocation. I am sure
there's a way to delay shard reallocation with the appropriate settings,
but I haven't figured out the right mixture yet. Any pointers would be
great!
Handling index growth: I watched Shay's talk on different data flows,
and our current setup is most like the "users flow": all documents have a
user and most queries are restricted to a single user, so we set the
routing parameter to be the user id. As our index grows, I am worried that
individual shards will get to be too large. Most of our queries are over all of a user's data, so creating a new index per-week (say) doesn't seem
like the best solution, as it eliminates the benefits of routing.
I understand the reasons for not allowing live shard splitting. Our current
plan in case we need to split shards is to start up a second cluster with
more shards, duplicate live data to that cluster, and then use a backup to
fill in the old data. Is there a better approach? Or should we really
consider time-range indices?
Is there a good way to handle single node restarts? I've dug
through past posts and can't find anything very useful. We're using
the local gateway (and local store), so it should be possible for a
single-node restart to trigger a minimal amount of shard reallocation.
I am sure there's a way to delay shard reallocation with the
appropriate settings, but I haven't figured out the right mixture yet.
Any pointers would be great!
You can use the cluster update settings API
to set cluster.routing.allocation.disable_allocation before restarting,
which will ensure that shards are not reallocated. Just remember to
reenable it after your node has started up again.
Handling index growth: I watched Shay's talk on different data
flows, and our current setup is most like the "users flow": all
documents have a user and most queries are restricted to a single
user, so we set the routing parameter to be the user id. As our index
grows, I am worried that individual shards will get to be too large.
Most of our queries are over all of a user's data, so creating a new
index per-week (say) doesn't seem like the best solution, as it
eliminates the benefits of routing.
If your use case fits the index-per-user model, then don't worry about
the time-based model.
The key to this flexibility is aliases.
For example, you have a user 'foo' who starts out in your general
all-users index. You can set up two aliases, one for writing and one for
reading (I'll explain why later):
all documents for this client will be stored on a single shard in
your all_clients index
when querying, the filter {client_id == 'foo'} will be automatically
applied to the results, so the 'foo' client appears to live in its
own index
So why two aliases?
You can only write to a single index (or an alias
which points to a single index), but you can query multiple indices. So
to avoid making changes (like the ones explained below) to your
application in the future, start out using separate foo_write and
foo_read aliases.
Now, the client has grown large enough to warrant their own index.
You can create a new index 'foo_v1' and then adjust your aliases so that
foo_write points to the new index, and foo_read points to BOTH:
So now, all writes will go the the new foo_v1 index, but queries will go
both to foo_v1 and the old "all_clients/routing:foo/client_id==foo"
index as well.
The only thing that you need to be careful about is getting and updating
existing docs.
The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:
try the new index first, and if that fails to find the doc, try the
old index
use a query instead of a GET
move the old data into the new index
Similarly, when you update/reindex an existing doc you must either:
store it in the same index that it came from, or
delete it from the old index and save it to the new index
Thanks for your answer --- it's really helpful. Luckily (?), all of our
code is written in terms of queries, not GETs, as we are using
Elasticsearch as a form of secondary indexing for our (distributed)
database, so any documents for which we already have the id can bypass ES
altogether.
Carl
On Monday, October 8, 2012 2:38:13 AM UTC-7, Clinton Gormley wrote:
The only thing that you need to be careful about is getting and updating
existing docs.
The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:
try the new index first, and if that fails to find the doc, try the
old index
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.