Single-node restarts and re-indexing


(Carl C) #1

Hi everyone,

I'm testing a production ElasticSearch system, and I have two
administrative questions:

  1. Is there a good way to handle single node restarts? I've dug through
    past posts and can't find anything very useful. We're using the local
    gateway (and local store), so it should be possible for a single-node
    restart to trigger a minimal amount of shard reallocation. I am sure
    there's a way to delay shard reallocation with the appropriate settings,
    but I haven't figured out the right mixture yet. Any pointers would be
    great!

  2. Handling index growth: I watched Shay's talk on different data flows,
    and our current setup is most like the "users flow": all documents have a
    user and most queries are restricted to a single user, so we set the
    routing parameter to be the user id. As our index grows, I am worried that
    individual shards will get to be too large. Most of our queries are over
    all of a user's data, so creating a new index per-week (say) doesn't seem
    like the best solution, as it eliminates the benefits of routing.

I understand the reasons for not allowing live shard splitting. Our current
plan in case we need to split shards is to start up a second cluster with
more shards, duplicate live data to that cluster, and then use a backup to
fill in the old data. Is there a better approach? Or should we really
consider time-range indices?

Thanks,
Carl

--


(Clinton Gormley) #2

Hi Carl

  1. Is there a good way to handle single node restarts? I've dug
    through past posts and can't find anything very useful. We're using
    the local gateway (and local store), so it should be possible for a
    single-node restart to trigger a minimal amount of shard reallocation.
    I am sure there's a way to delay shard reallocation with the
    appropriate settings, but I haven't figured out the right mixture yet.
    Any pointers would be great!

You can use the cluster update settings API
http://www.elasticsearch.org/guide/reference/api/admin-cluster-update-settings.html
to set cluster.routing.allocation.disable_allocation before restarting,
which will ensure that shards are not reallocated. Just remember to
reenable it after your node has started up again.

  1. Handling index growth: I watched Shay's talk on different data
    flows, and our current setup is most like the "users flow": all
    documents have a user and most queries are restricted to a single
    user, so we set the routing parameter to be the user id. As our index
    grows, I am worried that individual shards will get to be too large.
    Most of our queries are over all of a user's data, so creating a new
    index per-week (say) doesn't seem like the best solution, as it
    eliminates the benefits of routing.

If your use case fits the index-per-user model, then don't worry about
the time-based model.

The key to this flexibility is aliases.

For example, you have a user 'foo' who starts out in your general
all-users index. You can set up two aliases, one for writing and one for
reading (I'll explain why later):

curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"add" : {
"index" : "all_clients",
"alias" : "foo_write",
"routing" : "foo"
}
},
{
"add" : {
"index" : "all_clients",
"filter" : {
"term" : {
"client_id" : "foo"
}
},
"alias" : "foo_read",
"routing" : "foo"
}
}
]
}
'

The above means that:

  • all documents for this client will be stored on a single shard in
    your all_clients index
  • when querying, the filter {client_id == 'foo'} will be automatically
    applied to the results, so the 'foo' client appears to live in its
    own index

So why two aliases?

You can only write to a single index (or an alias
which points to a single index), but you can query multiple indices. So
to avoid making changes (like the ones explained below) to your
application in the future, start out using separate foo_write and
foo_read aliases.

Now, the client has grown large enough to warrant their own index.

You can create a new index 'foo_v1' and then adjust your aliases so that
foo_write points to the new index, and foo_read points to BOTH:

curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"remove" : {
"index" : "all_clients",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_read"
}
}
]
}
'

So now, all writes will go the the new foo_v1 index, but queries will go
both to foo_v1 and the old "all_clients/routing:foo/client_id==foo"
index as well.

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

  1. try the new index first, and if that fails to find the doc, try the
    old index
  2. use a query instead of a GET
  3. move the old data into the new index

Similarly, when you update/reindex an existing doc you must either:

  1. store it in the same index that it came from, or
  2. delete it from the old index and save it to the new index

clint

--


(Clinton Gormley) #3

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

  1. try the new index first, and if that fails to find the doc, try the
    old index
  2. use a query instead of a GET
  3. move the old data into the new index

I've opened an issue with some ideas about how to resolve the above more
transparently:

clint

--


(Carl C) #4

Hi Clint,

Thanks for your answer --- it's really helpful. Luckily (?), all of our
code is written in terms of queries, not GETs, as we are using
ElasticSearch as a form of secondary indexing for our (distributed)
database, so any documents for which we already have the id can bypass ES
altogether.

Carl

On Monday, October 8, 2012 2:38:13 AM UTC-7, Clinton Gormley wrote:

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

  1. try the new index first, and if that fails to find the doc, try the
    old index
  2. use a query instead of a GET
  3. move the old data into the new index

I've opened an issue with some ideas about how to resolve the above more
transparently:
https://github.com/elasticsearch/elasticsearch/issues/2309

clint

--


(system) #5