Node Local Storage and Gateway Storage

Hello everyone,

I am not sure if this question has been asked before but wanted to get
fresh answer if anything has changed recently.

I am exploring options for search application with large index (few
TB). We would have couple or few indexes (e.g. books, videos,
articles, other web resources). I looked at node storage (http://
www.elasticsearch.org/guide/reference/index-modules/store.html) and
gateway storage (http://www.elasticsearch.org/guide/reference/modules/
gateway/) but would like to clarify couple scenario's/approaches.
Could someone please help me with answers to following questions?

  1. Is it possible to store indices data locally and still have gateway
    persistence to S3 for long term persistency?
  2. What's the benefit of using multiple indexes with respect to index
    size? Just from the search performance (for large index data)
    perspective, which option is better, a) use single flat index or b)
    use multiple indices and for those rare cases where you do really need
    to search on multiple indices.
  3. How does sharding work when adding and removing nodes? Consider
    following scenario
    a. Started cluster with 1 node, with 1 index with 5 shards and 1
    replication
    b. Index documents (let's assume that all shards do get some
    data), but we still have single node, so it will hold all shards,
    cluster status will be yellow at this point
    c. Now, we add new node to cluster. Would replication farm be
    transferred over to new node or some of 5 shards or both? Will this
    bring cluster status to green?
    d. Now, let's assume that we add 3 more nodes, at this point we
    have total 5 nodes. Will cluster level be green with 5 nodes or we
    absolutely need MAX (shard) * MAX (replication) nodes to have cluster
    node to green?
    e. Now, we add 5 more nodes, we have total 10 now. So, at this
    point, will every node have single shard (5 primary + 5 secondary/
    replica)? How is this calculated?
    f. In what scenario we would loose shards (let's assume we are
    using local gateway)? If I continue bringing down each node, and go
    back to single node, can I go back to cluster state yellow (point b)
    without loosing any shards/data?

Appreciate your help.

Thanks,
Mihir

On Wed, Jan 18, 2012 at 8:47 AM, Mihir Patel exploremihir@gmail.com wrote:

Hello everyone,

I am not sure if this question has been asked before but wanted to get
fresh answer if anything has changed recently.

I am exploring options for search application with large index (few
TB). We would have couple or few indexes (e.g. books, videos,
articles, other web resources). I looked at node storage (http://
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
gateway storage (Elasticsearch Platform — Find real-time answers at scale | Elastic
gateway/) but would like to clarify couple scenario's/approaches.
Could someone please help me with answers to following questions?

  1. Is it possible to store indices data locally and still have gateway
    persistence to S3 for long term persistency?

Its not an option, thats how it works. Nodes still hold local storage of
what is stored on S3, spread across the nodes.

  1. What's the benefit of using multiple indexes with respect to index
    size? Just from the search performance (for large index data)
    perspective, which option is better, a) use single flat index or b)
    use multiple indices and for those rare cases where you do really need
    to search on multiple indices.

Searching on 1 index with 100 shards is the same as searching across 100
indices with 1 shard.

  1. How does sharding work when adding and removing nodes? Consider
    following scenario
    a. Started cluster with 1 node, with 1 index with 5 shards and 1
    replication
    b. Index documents (let's assume that all shards do get some
    data), but we still have single node, so it will hold all shards,
    cluster status will be yellow at this point
    c. Now, we add new node to cluster. Would replication farm be
    transferred over to new node or some of 5 shards or both? Will this
    bring cluster status to green?

Yes, 5 replicas will be allocated to the other node.

d. Now, let's assume that we add 3 more nodes, at this point we
have total 5 nodes. Will cluster level be green with 5 nodes or we
absolutely need MAX (shard) * MAX (replication) nodes to have cluster
node to green?

the cluster will be green. More over, the 10 shards you have (5 shard + 1
replica each) will be spread across the 5 nodes now. Use cluster state API,
you will see where and what is allocated where. Or install the
elasticsearch head plugin.

e. Now, we add 5 more nodes, we have total 10 now. So, at this
point, will every node have single shard (5 primary + 5 secondary/
replica)? How is this calculated?

Yes. Teh calc is to aim to have an even number of shards per node.

f. In what scenario we would loose shards (let's assume we are
using local gateway)? If I continue bringing down each node, and go
back to single node, can I go back to cluster state yellow (point b)
without loosing any shards/data?

If you bring down one node, then the shards allocated on it will now start
ot be allocated on the rest of the cluster. Once it hits green, you can
bring down another node. Note, even if you loose two nodes, and a shard and
a replica were on both them, but still can bring them back (at least with
the same data), then the relevant shard will be reallocated.

Appreciate your help.

Thanks,
Mihir