Single-node restarts and re-indexing

Carl_C · October 5, 2012, 2:45am

Hi everyone,

I'm testing a production ElasticSearch system, and I have two
administrative questions:

Is there a good way to handle single node restarts? I've dug through
past posts and can't find anything very useful. We're using the local
gateway (and local store), so it should be possible for a single-node
restart to trigger a minimal amount of shard reallocation. I am sure
there's a way to delay shard reallocation with the appropriate settings,
but I haven't figured out the right mixture yet. Any pointers would be
great!
Handling index growth: I watched Shay's talk on different data flows,
and our current setup is most like the "users flow": all documents have a
user and most queries are restricted to a single user, so we set the
routing parameter to be the user id. As our index grows, I am worried that
individual shards will get to be too large. Most of our queries are over
all of a user's data, so creating a new index per-week (say) doesn't seem
like the best solution, as it eliminates the benefits of routing.

I understand the reasons for not allowing live shard splitting. Our current
plan in case we need to split shards is to start up a second cluster with
more shards, duplicate live data to that cluster, and then use a backup to
fill in the old data. Is there a better approach? Or should we really
consider time-range indices?

Thanks,
Carl

--

Clinton_Gormley · October 8, 2012, 9:17am

Hi Carl

Is there a good way to handle single node restarts? I've dug
through past posts and can't find anything very useful. We're using
the local gateway (and local store), so it should be possible for a
single-node restart to trigger a minimal amount of shard reallocation.
I am sure there's a way to delay shard reallocation with the
appropriate settings, but I haven't figured out the right mixture yet.
Any pointers would be great!

You can use the cluster update settings API

to set cluster.routing.allocation.disable_allocation before restarting,
which will ensure that shards are not reallocated. Just remember to
reenable it after your node has started up again.

Handling index growth: I watched Shay's talk on different data
flows, and our current setup is most like the "users flow": all
documents have a user and most queries are restricted to a single
user, so we set the routing parameter to be the user id. As our index
grows, I am worried that individual shards will get to be too large.
Most of our queries are over all of a user's data, so creating a new
index per-week (say) doesn't seem like the best solution, as it
eliminates the benefits of routing.

If your use case fits the index-per-user model, then don't worry about
the time-based model.

The key to this flexibility is aliases.

For example, you have a user 'foo' who starts out in your general
all-users index. You can set up two aliases, one for writing and one for
reading (I'll explain why later):

curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"add" : {
"index" : "all_clients",
"alias" : "foo_write",
"routing" : "foo"
}
},
{
"add" : {
"index" : "all_clients",
"filter" : {
"term" : {
"client_id" : "foo"
}
},
"alias" : "foo_read",
"routing" : "foo"
}
}
]
}
'

The above means that:

all documents for this client will be stored on a single shard in
your all_clients index
when querying, the filter {client_id == 'foo'} will be automatically
applied to the results, so the 'foo' client appears to live in its
own index

So why two aliases?

You can only write to a single index (or an alias
which points to a single index), but you can query multiple indices. So
to avoid making changes (like the ones explained below) to your
application in the future, start out using separate foo_write and
foo_read aliases.

Now, the client has grown large enough to warrant their own index.

You can create a new index 'foo_v1' and then adjust your aliases so that
foo_write points to the new index, and foo_read points to BOTH:

curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d '
{
"actions" : [
{
"remove" : {
"index" : "all_clients",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_write"
}
},
{
"add" : {
"index" : "foo_v1",
"alias" : "foo_read"
}
}
]
}
'

So now, all writes will go the the new foo_v1 index, but queries will go
both to foo_v1 and the old "all_clients/routing:foo/client_id==foo"
index as well.

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

try the new index first, and if that fails to find the doc, try the
old index
use a query instead of a GET
move the old data into the new index

Similarly, when you update/reindex an existing doc you must either:

store it in the same index that it came from, or
delete it from the old index and save it to the new index

clint

--

Clinton_Gormley · October 8, 2012, 9:38am

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

try the new index first, and if that fails to find the doc, try the
old index

use a query instead of a GET

move the old data into the new index

I've opened an issue with some ideas about how to resolve the above more
transparently:

github.com/elastic/elasticsearch

Multi-index document GET

opened 09:37AM - 08 Oct 12 UTC

closed 09:34AM - 20 May 14 UTC

clintongormley

As described in this post https://groups.google.com/d/msg/elasticsearch/lhoG0gE9…Kvo/ez3XXIlsAx0J there is one bit missing from the index-per-user model which gets in the way of making things transparent to the application. That 'bit' is that, given a document type and ID, you have to know which index the document lives in in order to be able to GET it. So when you have documents sitting in your general all-users index and you add a user-specific index, you can't easily know which index to talk to in order to retrieve a document. What about some mechanism for doing a GET from a list of indices? In other words, try index 1, if the document doesn't exist, try index 2, etc Perhaps it is a `read_priority` setting in an alias? Something like: ``` curl -XPOST 'http://127.0.0.1:9200/_aliases?pretty=1' -d ' { "actions" : [ { "add" : { "read_priority" : 1, "index" : "foo_v1", "default" : 1, "alias" : "foo" } }, { "add" : { "read_priority" : 2, "index" : "all_users", "filter" : { "term" : { "user_id" : "foo" } }, "routing" : "foo", "alias" : "foo" } } ] } ' ``` I've also added a `default` param which indicates that, when creating new documents, the index `foo_v1` should be used (as opposed to just failing with "can't write to multiple indices")

clint

--

Carl_C · October 8, 2012, 5:46pm

Hi Clint,

Thanks for your answer --- it's really helpful. Luckily (?), all of our
code is written in terms of queries, not GETs, as we are using
Elasticsearch as a form of secondary indexing for our (distributed)
database, so any documents for which we already have the id can bypass ES
altogether.

Carl

On Monday, October 8, 2012 2:38:13 AM UTC-7, Clinton Gormley wrote:

The only thing that you need to be careful about is getting and updating
existing docs.

The doc-GET API can't use the 'doc_read' alias because it points to more
than one index. You have a few choices:

try the new index first, and if that fails to find the doc, try the
old index

use a query instead of a GET

move the old data into the new index

I've opened an issue with some ideas about how to resolve the above more
transparently:
Multi-index document GET · Issue #2309 · elastic/elasticsearch · GitHub

clint

--

Topic		Replies	Views
Elasticsearch rebalancing on single node Elasticsearch	6	1338	July 5, 2017
Rolling restart elasticsearch cluster Elasticsearch	5	1830	July 5, 2017
Rolling restart Elasticsearch	5	404	July 6, 2017
Do I need to follow Full-cluster restart steps for 1 Node Elasticsearch	5	288	September 28, 2021
Restarting node takes time Elasticsearch	4	1080	July 5, 2017

Single-node restarts and re-indexing

Related topics