Understanding nodes

I'm having a little trouble understanding how Elasticsearch handles the
data stored in nodes and how it behaves when one of the nodes fails or you
add more nodes.

My questions are:

  1. Let's say you have one node (default config) with some data and you
    add another one. Does ES replicate all the data to the new node or does it
    give it a partial set of data?
  2. I understand that you can config ES so that the data is distributed
    against several nodes and it will intelligently query all available nodes
    and give you a complete result set. What happens if a node goes down? Do
    you loose data? What node is responsible for persisting data?

I know that the answers to these questions may start with "depends on your
config". If that is the case, I'm interested in only these two scenarios:
Scenario 1) The ES instances are in Amazon EC2 and using either the EC2 EBS
gateway or the S3 gateway.
Scenario 2) The ES instances are in Amazon EC2 and just using the instance
store (non-ebs store).

Thanks,
Julian.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. It depends. As you said "default" config, I guess we assume here also default index settings, which is 5 shards and 1 replica per shard.
    So when you start a new node, replicas will be allocated on the new node. That means all data will be replicated.

  2. All nodes have the same behavior unless you change default settings. There is no specific node dedicated to persist data. All data are locally stored.

Don't use S3 gateways. They are deprecated: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway-s3.html
Using EBS is fine as long as you ask for provisioned IOPS.
Local disks are perfectly fine as well. SSD drives are better as you can guess! :slight_smile:

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

4 novembre 2013 at 21:00:44, Julian Vidal (julian@julianvidal.com) a écrit:

I'm having a little trouble understanding how Elasticsearch handles the data stored in nodes and how it behaves when one of the nodes fails or you add more nodes.

My questions are:
Let's say you have one node (default config) with some data and you add another one. Does ES replicate all the data to the new node or does it give it a partial set of data?
I understand that you can config ES so that the data is distributed against several nodes and it will intelligently query all available nodes and give you a complete result set. What happens if a node goes down? Do you loose data? What node is responsible for persisting data?
I know that the answers to these questions may start with "depends on your config". If that is the case, I'm interested in only these two scenarios:
Scenario 1) The ES instances are in Amazon EC2 and using either the EC2 EBS gateway or the S3 gateway.
Scenario 2) The ES instances are in Amazon EC2 and just using the instance store (non-ebs store).

Thanks,
Julian.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It all depends on your shard setup. There are two types of shards in
Elasticsearch:

  • Primary shards: Your data is broken up into primary shards and spread
    across your cluster. If you have 5 primary shards, your data is split into
    five pieces.
  • Replica shards: These are just copies of the above primary shards.
    They provide high-availability in case of failure

When you add a node, what happens depends on your primary/replica setup.
We can walk through a few examples:

1 Primary Shard, 0 Replicas

  • With one node, one primary shard is allocated on the machine
  • When you add a new node...nothing happens. You only have one primary
    shard, so no data can move around. The new node goes unused

2 Primary Shards, 0 Replicas

  • With one node, both primary shards reside on the single machine.
  • When you add a node, one primary shard will move to the new machine.
    Elasticsearch tries to maintain a balance of data across nodes to spread
    load
  • You have zero replicas, so if a node fails, you will lose data.
    Primary shards simply split your data up so it can be moved around

1 Primary Shards, 1 Replicas

  • With one node, a single primary shard is allocated. The replica
    remains unallocated because it doesn't make sense to put a replica next to
    a primary
  • When you add a node, the a replica of the primary is created on the
    new machine. You now have one primary and one replica
  • If a node fails, you do not lose data. The replica can be "promoted"
    to primary status if something goes wrong

2 Primary Shards, 1 Replicas

  • With one node, two primary shards are allocated. No replicas are
    allocated
  • When you add a node, one primary shard will move to the new machine.
    Replicas will also be created for both primaries (but on opposite nodes).
    You will now have two primary shards, and a replica for each
  • If a node fails, you do not lose data. Replicas are promoted
    as necessary

Does that make sense? Basically, primaries control how much you would like
to divide your data (so you can scale to more nodes), while replicas are
how many extra copies you want to keep around to prevent data-loss.

Let me know if you have more questions!
-Zach

On Monday, November 4, 2013 3:00:41 PM UTC-5, Julian Vidal wrote:

I'm having a little trouble understanding how Elasticsearch handles the
data stored in nodes and how it behaves when one of the nodes fails or you
add more nodes.

My questions are:

  1. Let's say you have one node (default config) with some data and you
    add another one. Does ES replicate all the data to the new node or does it
    give it a partial set of data?
  2. I understand that you can config ES so that the data is distributed
    against several nodes and it will intelligently query all available nodes
    and give you a complete result set. What happens if a node goes down? Do
    you loose data? What node is responsible for persisting data?

I know that the answers to these questions may start with "depends on your
config". If that is the case, I'm interested in only these two scenarios:
Scenario 1) The ES instances are in Amazon EC2 and using either the EC2
EBS gateway or the S3 gateway.
Scenario 2) The ES instances are in Amazon EC2 and just using the instance
store (non-ebs store).

Thanks,
Julian.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.