Elastic Search Architecture at petabyte scale

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging elastic search to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

  1. I have read some comments that recommend not using a single ES cluster
    for petabyte levels of data; that it is better to create separate clusters
    at this scale (e.g. a separate cluster for each month). If that is the
    case, are there capabilities for doing cross cluster search/aggregation of
    results, or would that be implemented by the application?

  2. I have read mixed information about the split brain issue. Because our
    archive is so large, we cannot afford to reindex large portions of it, so
    the split brain issue is a significant concern. On the one hand, I have
    read that with proper configuration, split brains is not a problem. I have
    also read that even with proper configuration it is still possible to have
    split brains. So let me pose the question this way: Suppose you would be
    fired if you ever had to reindex more than 5 nodes in your cluster at
    once...would you still use ElasticSearch given the split brain issue
    (assume perfect configuration, i.e. that the splitting was not caused by a
    configuration error, but that are network disruptions between the nodes is
    possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. Can you give pointers to the recommendation to use more than one cluster
    for petabyte scale? It is hard to believe, because you are right, there is
    no support for cross cluster indexing/search (you have to build your own
    solution at the app level with many clients)

  2. From my understanding, the zen discovery (the default discovery module
    in ES) requires a sound network, and on that basis, it can detect faulty
    nodes by sending multicast or unicast pings. If the network is partially
    not available, there are challenges for detecting missing communication
    between nodes. But there are alternatives like a Zookeeper based discovery
    plugin https://github.com/sonian/elasticsearch-zookeeper Zookeeper is known
    to be a robust implementation for consensus.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Friday, October 25, 2013 2:51:35 PM UTC+2, Milad Fatenejad wrote:

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging Elasticsearch to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

  1. I have read some comments that recommend not using a single ES cluster
    for petabyte levels of data; that it is better to create separate clusters
    at this scale (e.g. a separate cluster for each month). If that is the
    case, are there capabilities for doing cross cluster search/aggregation of
    results, or would that be implemented by the application?
  1. I have read mixed information about the split brain issue. Because our
    archive is so large, we cannot afford to reindex large portions of it, so
    the split brain issue is a significant concern. On the one hand, I have
    read that with proper configuration, split brains is not a problem. I have
    also read that even with proper configuration it is still possible to have
    split brains. So let me pose the question this way: Suppose you would be
    fired if you ever had to reindex more than 5 nodes in your cluster at
    once...would you still use Elasticsearch given the split brain issue
    (assume perfect configuration, i.e. that the splitting was not caused by a
    configuration error, but that are network disruptions between the nodes is
    possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

Split brain would only affect the indices you are writing to and if you
create a new index every day, you'd never lose more than a day worth of
indexing (assuming you have some backup strategy for the rest). All your
other indices are read only.

Regarding the data size, I think the only sensible thing to say about this
is that the number of shards and indices are eventually going to have an
impact on your query times. Say you have 1GB shards and a 1 petabyte of
index, that means a single query has to hit hundreds of thousands of shards
to come back with an answer. I imagine that might take a while, even in ES
and that could be an argument for splitting a cluster. Say you have some
super duper setup with 10 such shards per node that is a big whopping
cluster with at least 20K servers (accounting for replicas here).

Of course, just because your data size is measured in petabytes doesn't
mean your index size is going to be anywhere near that size. I could
imagine not storing the data in your index, would vastly reduce the index
size to a much more manageable size.

Anyway, there are so many things to consider here that you are not telling
us that it is impossible to answer this. The only thing I could advice is
that at this scale, you might just want to get in touch with the ES guys
and get them involved in supporting you directly. That might just save you
tons of money and time spent on misguided infrastructure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Jilles:

Thank you this was very helpful. I am planning to create smallish indexes
(with a days worth of data) with backup. I did not occur to me that the
older indexes would be read only, and not susceptible to split brain so
this seems like a really good strategy for mitigating potential split brain
issues. My takeaway from your, and other, comments that I have read is that
it is important to experiment with different configurations with my actual
data and see what works. If I get to the point where multiple clusters are
necessary, I don't think that will really be a big problem. Hopefully, I
can convince my managers to pay for support/training from ES directly, I
think this is really good advice.

Thanks Again!
Milad

On Sat, Oct 26, 2013 at 3:30 AM, Jilles van Gurp jillesvangurp@gmail.comwrote:

On Friday, October 25, 2013 2:51:35 PM UTC+2, Milad Fatenejad wrote:

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging Elasticsearch to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

  1. I have read some comments that recommend not using a single ES cluster
    for petabyte levels of data; that it is better to create separate clusters
    at this scale (e.g. a separate cluster for each month). If that is the
    case, are there capabilities for doing cross cluster search/aggregation of
    results, or would that be implemented by the application?
  1. I have read mixed information about the split brain issue. Because our
    archive is so large, we cannot afford to reindex large portions of it, so
    the split brain issue is a significant concern. On the one hand, I have
    read that with proper configuration, split brains is not a problem. I have
    also read that even with proper configuration it is still possible to have
    split brains. So let me pose the question this way: Suppose you would be
    fired if you ever had to reindex more than 5 nodes in your cluster at
    once...would you still use Elasticsearch given the split brain issue
    (assume perfect configuration, i.e. that the splitting was not caused by a
    configuration error, but that are network disruptions between the nodes is
    possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

Split brain would only affect the indices you are writing to and if you
create a new index every day, you'd never lose more than a day worth of
indexing (assuming you have some backup strategy for the rest). All your
other indices are read only.

Regarding the data size, I think the only sensible thing to say about this
is that the number of shards and indices are eventually going to have an
impact on your query times. Say you have 1GB shards and a 1 petabyte of
index, that means a single query has to hit hundreds of thousands of shards
to come back with an answer. I imagine that might take a while, even in ES
and that could be an argument for splitting a cluster. Say you have some
super duper setup with 10 such shards per node that is a big whopping
cluster with at least 20K servers (accounting for replicas here).

Of course, just because your data size is measured in petabytes doesn't
mean your index size is going to be anywhere near that size. I could
imagine not storing the data in your index, would vastly reduce the index
size to a much more manageable size.

Anyway, there are so many things to consider here that you are not telling
us that it is impossible to answer this. The only thing I could advice is
that at this scale, you might just want to get in touch with the ES guys
and get them involved in supporting you directly. That might just save you
tons of money and time spent on misguided infrastructure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.