Agreed on 50TB meaning a lot of different things. 50TB of .gz raw data is
a lot different from 50TB of big fluffy JSON is a lot different from 50TB
of raid-10, replicated indices. I've started telling our internal users
that I can give them any number they want for "size", because it can vary
by two orders of magnitude based on where it is in the pipe. If they want
a number that actually means something reasonably consistent, I'll give
them number of docs.
We have around 30 billion documents online in ES; for us, a billion docs is
a good per-server limit. Each server uses about 1.5TB of disk for indices,
double that for a replica, triple it if you want to be able to reindex from
hdfs without taking existing data offline until you're done and switch
aliases over. Double it again if you aren't using compression. (Use
compression.) ES does very well at that sort of scale (beats the pants off
SOLR, at any rate), although it will occasionally break in exotic ways, and
it can be hard to figure out which query is problematic if you're throwing
a lot of traffic at it. Our servers all have(and need) 96GB of ram, split
evenly between java heap and OS cache, but your mileage will vary a lot on
that based on your specific use case.
On Tuesday, October 30, 2012 8:36:18 PM UTC-5, Otis Gospodnetic wrote:
Some of Sematext's clients use ES on that sort of scale. 50TB of data is
not precise enough if you are referring to raw data - who knows how much of
that will be indexed, how much stored, etc. 16GB RAM sounds lowish, unless
indices are not more than 50GB, say. But that is also not an accurate
statement, because even a 100GB may be fine if you only ever hit it with a
handful of queries. Or if you use routing well. Or if you are OK with
high latency, of course
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html
On Tuesday, October 30, 2012 6:54:19 AM UTC-4, Robin Verlangen wrote:
I was wondering whether there are users that deploy ES on massive scale.
For example 50TB of data. This is currently stored in HDFS and searching is
possible with Map/Reduce jobs. However something like ES would be really
Is it possible? If so, what kind of cluster should you think of?
Currently we run an eight-node cluster with 12x1TB disk per server, 16GB
RAM and dual-quadcore.
Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.