Elastic Search Architecture at petabyte scale

Milad_Fatenejad · October 25, 2013, 12:51pm

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging elastic search to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

I have read some comments that recommend not using a single ES cluster
for petabyte levels of data; that it is better to create separate clusters
at this scale (e.g. a separate cluster for each month). If that is the
case, are there capabilities for doing cross cluster search/aggregation of
results, or would that be implemented by the application?
I have read mixed information about the split brain issue. Because our
archive is so large, we cannot afford to reindex large portions of it, so
the split brain issue is a significant concern. On the one hand, I have
read that with proper configuration, split brains is not a problem. I have
also read that even with proper configuration it is still possible to have
split brains. So let me pose the question this way: Suppose you would be
fired if you ever had to reindex more than 5 nodes in your cluster at
once...would you still use ElasticSearch given the split brain issue
(assume perfect configuration, i.e. that the splitting was not caused by a
configuration error, but that are network disruptions between the nodes is
possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · October 25, 2013, 11:07pm

Can you give pointers to the recommendation to use more than one cluster
for petabyte scale? It is hard to believe, because you are right, there is
no support for cross cluster indexing/search (you have to build your own
solution at the app level with many clients)
From my understanding, the zen discovery (the default discovery module
in ES) requires a sound network, and on that basis, it can detect faulty
nodes by sending multicast or unicast pings. If the network is partially
not available, there are challenges for detecting missing communication
between nodes. But there are alternatives like a Zookeeper based discovery
plugin https://github.com/sonian/elasticsearch-zookeeper Zookeeper is known
to be a robust implementation for consensus.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jilles_van_Gurp · October 26, 2013, 8:30am

On Friday, October 25, 2013 2:51:35 PM UTC+2, Milad Fatenejad wrote:

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging Elasticsearch to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

I have read some comments that recommend not using a single ES cluster
for petabyte levels of data; that it is better to create separate clusters
at this scale (e.g. a separate cluster for each month). If that is the
case, are there capabilities for doing cross cluster search/aggregation of
results, or would that be implemented by the application?

I have read mixed information about the split brain issue. Because our
archive is so large, we cannot afford to reindex large portions of it, so
the split brain issue is a significant concern. On the one hand, I have
read that with proper configuration, split brains is not a problem. I have
also read that even with proper configuration it is still possible to have
split brains. So let me pose the question this way: Suppose you would be
fired if you ever had to reindex more than 5 nodes in your cluster at
once...would you still use Elasticsearch given the split brain issue
(assume perfect configuration, i.e. that the splitting was not caused by a
configuration error, but that are network disruptions between the nodes is
possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

Split brain would only affect the indices you are writing to and if you
create a new index every day, you'd never lose more than a day worth of
indexing (assuming you have some backup strategy for the rest). All your
other indices are read only.

Regarding the data size, I think the only sensible thing to say about this
is that the number of shards and indices are eventually going to have an
impact on your query times. Say you have 1GB shards and a 1 petabyte of
index, that means a single query has to hit hundreds of thousands of shards
to come back with an answer. I imagine that might take a while, even in ES
and that could be an argument for splitting a cluster. Say you have some
super duper setup with 10 such shards per node that is a big whopping
cluster with at least 20K servers (accounting for replicas here).

Of course, just because your data size is measured in petabytes doesn't
mean your index size is going to be anywhere near that size. I could
imagine not storing the data in your index, would vastly reduce the index
size to a much more manageable size.

Anyway, there are so many things to consider here that you are not telling
us that it is impossible to answer this. The only thing I could advice is
that at this scale, you might just want to get in touch with the ES guys
and get them involved in supporting you directly. That might just save you
tons of money and time spent on misguided infrastructure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Milad_Fatenejad · October 26, 2013, 4:17pm

Hello Jilles:

Thank you this was very helpful. I am planning to create smallish indexes
(with a days worth of data) with backup. I did not occur to me that the
older indexes would be read only, and not susceptible to split brain so
this seems like a really good strategy for mitigating potential split brain
issues. My takeaway from your, and other, comments that I have read is that
it is important to experiment with different configurations with my actual
data and see what works. If I get to the point where multiple clusters are
necessary, I don't think that will really be a big problem. Hopefully, I
can convince my managers to pay for support/training from ES directly, I
think this is really good advice.

Thanks Again!
Milad

On Sat, Oct 26, 2013 at 3:30 AM, Jilles van Gurp jillesvangurp@gmail.comwrote:

On Friday, October 25, 2013 2:51:35 PM UTC+2, Milad Fatenejad wrote:

Hello:

We have a very large document archive (multipetabyte) and are considering
leveraging Elasticsearch to provide more elastic indexing/search
capabilities. We archive large amounts of data every day, so I was
considering proposing an architecture where we create a new index for every
day, then use aliases to combine searches across any indexes that we may
like.

I have two questions:

I have read some comments that recommend not using a single ES cluster
for petabyte levels of data; that it is better to create separate clusters
at this scale (e.g. a separate cluster for each month). If that is the
case, are there capabilities for doing cross cluster search/aggregation of
results, or would that be implemented by the application?

I have read mixed information about the split brain issue. Because our
archive is so large, we cannot afford to reindex large portions of it, so
the split brain issue is a significant concern. On the one hand, I have
read that with proper configuration, split brains is not a problem. I have
also read that even with proper configuration it is still possible to have
split brains. So let me pose the question this way: Suppose you would be
fired if you ever had to reindex more than 5 nodes in your cluster at
once...would you still use Elasticsearch given the split brain issue
(assume perfect configuration, i.e. that the splitting was not caused by a
configuration error, but that are network disruptions between the nodes is
possible)? I am just trying to gauge how serious a problem this is for ES.

Thanks!
Milad

Split brain would only affect the indices you are writing to and if you
create a new index every day, you'd never lose more than a day worth of
indexing (assuming you have some backup strategy for the rest). All your
other indices are read only.

Regarding the data size, I think the only sensible thing to say about this
is that the number of shards and indices are eventually going to have an
impact on your query times. Say you have 1GB shards and a 1 petabyte of
index, that means a single query has to hit hundreds of thousands of shards
to come back with an answer. I imagine that might take a while, even in ES
and that could be an argument for splitting a cluster. Say you have some
super duper setup with 10 such shards per node that is a big whopping
cluster with at least 20K servers (accounting for replicas here).

Of course, just because your data size is measured in petabytes doesn't
mean your index size is going to be anywhere near that size. I could
imagine not storing the data in your index, would vastly reduce the index
size to a much more manageable size.

Anyway, there are so many things to consider here that you are not telling
us that it is impossible to answer this. The only thing I could advice is
that at this scale, you might just want to get in touch with the ES guys
and get them involved in supporting you directly. That might just save you
tons of money and time spent on misguided infrastructure.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Multi-datacenter and issue 2448 Elasticsearch	1	321	July 6, 2017
Scaling up for petabyte sizes? Elasticsearch	14	1340	June 1, 2018
Scaling: Cluster for speed or for size? Elasticsearch	6	355	July 6, 2017
Architecting cluster for fast searching Elasticsearch	3	404	January 7, 2019
Split brains after long GCs Elasticsearch	3	394	July 6, 2017

Elastic Search Architecture at petabyte scale

Related Topics