Massive scale elasticsearch

Hi there,

I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
sweet.

Is it possible? If so, what kind of cluster should you think of? Currently
we run an eight-node cluster with 12x1TB disk per server, 16GB RAM and
dual-quadcore.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

Hello Robin,

Yes, there are people using ES on massive scale. Take a look at this video
for such an usecase:

Regarding hardware requirements, it depends on lots of stuff. How your data
looks like, how your queries look like, if you want to search everywhere in
your data or only on certain fields, etc.

I'd start with a fraction of those 50TB on one (or a few) test machines and
do some performance testing to see how it goes. Then you will get a better
estimation.

And if you want to monitor your cluster while testing, there are quite a
few options. Of course I'd recommend ours :slight_smile:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 30, 2012 at 12:54 PM, Robin Verlangen robin@us2.nl wrote:

Hi there,

I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
sweet.

Is it possible? If so, what kind of cluster should you think of? Currently
we run an eight-node cluster with 12x1TB disk per server, 16GB RAM and
dual-quadcore.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

--

Hello Robin,

Some of Sematext's clients use ES on that sort of scale. 50TB of data is
not precise enough if you are referring to raw data - who knows how much of
that will be indexed, how much stored, etc. 16GB RAM sounds lowish, unless
indices are not more than 50GB, say. But that is also not an accurate
statement, because even a 100GB may be fine if you only ever hit it with a
handful of queries. Or if you use routing well. Or if you are OK with
high latency, of course :slight_smile:

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 30, 2012 6:54:19 AM UTC-4, Robin Verlangen wrote:

Hi there,

I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
sweet.

Is it possible? If so, what kind of cluster should you think of? Currently
we run an eight-node cluster with 12x1TB disk per server, 16GB RAM and
dual-quadcore.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl <javascript:>

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

Agreed on 50TB meaning a lot of different things. 50TB of .gz raw data is
a lot different from 50TB of big fluffy JSON is a lot different from 50TB
of raid-10, replicated indices. I've started telling our internal users
that I can give them any number they want for "size", because it can vary
by two orders of magnitude based on where it is in the pipe. If they want
a number that actually means something reasonably consistent, I'll give
them number of docs.

We have around 30 billion documents online in ES; for us, a billion docs is
a good per-server limit. Each server uses about 1.5TB of disk for indices,
double that for a replica, triple it if you want to be able to reindex from
hdfs without taking existing data offline until you're done and switch
aliases over. Double it again if you aren't using compression. (Use
compression.) ES does very well at that sort of scale (beats the pants off
SOLR, at any rate), although it will occasionally break in exotic ways, and
it can be hard to figure out which query is problematic if you're throwing
a lot of traffic at it. Our servers all have(and need) 96GB of ram, split
evenly between java heap and OS cache, but your mileage will vary a lot on
that based on your specific use case.

On Tuesday, October 30, 2012 8:36:18 PM UTC-5, Otis Gospodnetic wrote:

Hello Robin,

Some of Sematext's clients use ES on that sort of scale. 50TB of data is
not precise enough if you are referring to raw data - who knows how much of
that will be indexed, how much stored, etc. 16GB RAM sounds lowish, unless
indices are not more than 50GB, say. But that is also not an accurate
statement, because even a 100GB may be fine if you only ever hit it with a
handful of queries. Or if you use routing well. Or if you are OK with
high latency, of course :slight_smile:

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, October 30, 2012 6:54:19 AM UTC-4, Robin Verlangen wrote:

Hi there,

I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
sweet.

Is it possible? If so, what kind of cluster should you think of?
Currently we run an eight-node cluster with 12x1TB disk per server, 16GB
RAM and dual-quadcore.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

Thank you all for the information. The number I picked was an estimate
based on a forecast of growth. We will be using daily indices and most of
the time only using the last couple of days for searching. I think ES will
be a good choice for this as it seems to scale pretty well!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Sat, Nov 3, 2012 at 6:14 AM, Chuck McKenzie redchuck@gmail.com wrote:

Agreed on 50TB meaning a lot of different things. 50TB of .gz raw data is
a lot different from 50TB of big fluffy JSON is a lot different from 50TB
of raid-10, replicated indices. I've started telling our internal users
that I can give them any number they want for "size", because it can vary
by two orders of magnitude based on where it is in the pipe. If they want
a number that actually means something reasonably consistent, I'll give
them number of docs.

We have around 30 billion documents online in ES; for us, a billion docs
is a good per-server limit. Each server uses about 1.5TB of disk for
indices, double that for a replica, triple it if you want to be able to
reindex from hdfs without taking existing data offline until you're done
and switch aliases over. Double it again if you aren't using compression.
(Use compression.) ES does very well at that sort of scale (beats the
pants off SOLR, at any rate), although it will occasionally break in exotic
ways, and it can be hard to figure out which query is problematic if you're
throwing a lot of traffic at it. Our servers all have(and need) 96GB of
ram, split evenly between java heap and OS cache, but your mileage will
vary a lot on that based on your specific use case.

On Tuesday, October 30, 2012 8:36:18 PM UTC-5, Otis Gospodnetic wrote:

Hello Robin,

Some of Sematext's clients use ES on that sort of scale. 50TB of data is
not precise enough if you are referring to raw data - who knows how much of
that will be indexed, how much stored, etc. 16GB RAM sounds lowish, unless
indices are not more than 50GB, say. But that is also not an accurate
statement, because even a 100GB may be fine if you only ever hit it with a
handful of queries. Or if you use routing well. Or if you are OK with
high latency, of course :slight_smile:

Otis

Search Analytics - http://sematext.com/search-**analytics/index.htmlhttp://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.**htmlhttp://sematext.com/spm/index.html

On Tuesday, October 30, 2012 6:54:19 AM UTC-4, Robin Verlangen wrote:

Hi there,

I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
sweet.

Is it possible? If so, what kind of cluster should you think of?
Currently we run an eight-node cluster with 12x1TB disk per server, 16GB
RAM and dual-quadcore.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

Robin Verlangen wrote:

I was wondering whether there are users that deploy ES on
massive scale. For example 50TB of data. This is currently
stored in HDFS and searching is possible with Map/Reduce
jobs. However something like ES would be really sweet.

[...]

Thank you all for the information. The number I picked was an
estimate based on a forecast of growth. We will be using daily
indices and most of the time only using the last couple of days for
searching. I think ES will be a good choice for this as it seems to
scale pretty well!

Just to add a data point, we've comfortably run 60-100TiB clusters on
15 m1.xlarges (16GB RAM) EC2 nodes with 8- or 6-stripe, 4-8TiB EBS
volumes. That's with hundreds of 200GiB shards per node.

ES can easily handle the indexing and storage. Optimizing searching,
as others have pointed out, is usually where you have to spend time
getting it right.

-Drew

--

Hi Drew,

Thank you for increasing the confidence I gained in ES. It seems exactly
the right tool for what we need. If you want to search through such a
volumes, it's obvious you'll need some hardware!

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Wed, Nov 7, 2012 at 3:26 PM, Drew Raines aaraines@gmail.com wrote:

Robin Verlangen wrote:

I was wondering whether there are users that deploy ES on
massive scale. For example 50TB of data. This is currently
stored in HDFS and searching is possible with Map/Reduce
jobs. However something like ES would be really sweet.

[...]

Thank you all for the information. The number I picked was an
estimate based on a forecast of growth. We will be using daily
indices and most of the time only using the last couple of days for
searching. I think ES will be a good choice for this as it seems to
scale pretty well!

Just to add a data point, we've comfortably run 60-100TiB clusters on
15 m1.xlarges (16GB RAM) EC2 nodes with 8- or 6-stripe, 4-8TiB EBS
volumes. That's with hundreds of 200GiB shards per node.

ES can easily handle the indexing and storage. Optimizing searching,
as others have pointed out, is usually where you have to spend time
getting it right.

-Drew

--

--