Some ES stats (at yfrog.com)

Hello all, I thought I would share some stats on how we are using
ElasticSearch.

At yfrog, we are indexing corpus of tweets that users use in
conjunction with photos uploaded to twitter from mobile and web. Our
total index is about 950 million docs. We run 500 shards across 12
servers, which are 8 core x 16GB, and some are 12 core x 32GB boxes.
Routing allows us to provide timeline search (per user), so that each
user's data is always in one shard. Here is a screenshot btw:
http://a.yfrog.com/img703/8096/xil.png

Each shard is about 1G large, which makes total data to be about 1TB
with 50% of being replicated (e.g. 1 replica). The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

We are currently indexing 3ml new docs per day (new tweets). So far
the challenges have been to keep file descriptors low, we are
basically running with ulimit -n 256000. Merging down to appropriate
number of segments under shard is always a must, but that
significantly taxes CPU, and slows down bulk indexing should we ever
try to re-index the whole corpus.

On the other hand, as you can see from the screenshot the read search
performance is just awesome, you can try some searches yourself at
http://yfrog.com/user/TechCrunch/profile

-Jack

Hiya Jack

On Sun, 2012-02-19 at 21:58 -0800, Jack Levin wrote:

Hello all, I thought I would share some stats on how we are using
Elasticsearch.

great post - thanks for sharing.

The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

That's very interesting. I hadn't seen that native ZFS was available for
linux. A link for the interested: http://zfsonlinux.org/

clint

Thanks for sharing this.

It would be nice to add it here : Elasticsearch Platform — Find real-time answers at scale | Elastic

David :wink:
@dadoonet

Le 20 févr. 2012 à 06:58, Jack Levin magnito@gmail.com a écrit :

Hello all, I thought I would share some stats on how we are using
Elasticsearch.

At yfrog, we are indexing corpus of tweets that users use in
conjunction with photos uploaded to twitter from mobile and web. Our
total index is about 950 million docs. We run 500 shards across 12
servers, which are 8 core x 16GB, and some are 12 core x 32GB boxes.
Routing allows us to provide timeline search (per user), so that each
user's data is always in one shard. Here is a screenshot btw:
http://a.yfrog.com/img703/8096/xil.png

Each shard is about 1G large, which makes total data to be about 1TB
with 50% of being replicated (e.g. 1 replica). The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

We are currently indexing 3ml new docs per day (new tweets). So far
the challenges have been to keep file descriptors low, we are
basically running with ulimit -n 256000. Merging down to appropriate
number of segments under shard is always a must, but that
significantly taxes CPU, and slows down bulk indexing should we ever
try to re-index the whole corpus.

On the other hand, as you can see from the screenshot the read search
performance is just awesome, you can try some searches yourself at
ImageShack - Best place for all of your image hosting and image sharing needs

-Jack

Hiya Jack

On Sun, 2012-02-19 at 21:58 -0800, Jack Levin wrote:

Hello all, I thought I would share some stats on how we are using
Elasticsearch.

great post - thanks for sharing.

The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

That's very interesting. I hadn't seen that native ZFS was available for
linux. A link for the interested: http://zfsonlinux.org/

clint

Jack, really thanks for sharing the data!. Its really important for people to share this type of info with elasticsearch users.

Regarding the segments, by default, elasticsearch uses the tiered merge policy (Elasticsearch Platform — Find real-time answers at scale | Elastic). The important thing to remember about it is that it has (an estimated) max size bound on a segment. The max_merged_segment is the setting, and it defaults to 5gb (same as in Lucene). This can cause for many segments to be created for a large index. I have been back and forth and possibly increasing this default value to a higher value (which will cause more merges, but less resources being used - file descriptors, memory, faster searches).

One thing that you can use to gauge it is the segments API, which returns detailed data on segments used in each shard of an index. Lukas has been doing amazing job with big desk, I hope that we will be able to improve on it and move it to start providing "index" level stats on top of node level focused stats that it provides today. One of the things we will do then is to provide a nice visualization of the segments each shard has, so people can decide what good values can eb set (all can eb set in realtime on a live index), or even when to issue optimization calls more easily.

On Monday, February 20, 2012 at 7:58 AM, Jack Levin wrote:

Hello all, I thought I would share some stats on how we are using
Elasticsearch.

At yfrog, we are indexing corpus of tweets that users use in
conjunction with photos uploaded to twitter from mobile and web. Our
total index is about 950 million docs. We run 500 shards across 12
servers, which are 8 core x 16GB, and some are 12 core x 32GB boxes.
Routing allows us to provide timeline search (per user), so that each
user's data is always in one shard. Here is a screenshot btw:
http://a.yfrog.com/img703/8096/xil.png

Each shard is about 1G large, which makes total data to be about 1TB
with 50% of being replicated (e.g. 1 replica). The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

We are currently indexing 3ml new docs per day (new tweets). So far
the challenges have been to keep file descriptors low, we are
basically running with ulimit -n 256000. Merging down to appropriate
number of segments under shard is always a must, but that
significantly taxes CPU, and slows down bulk indexing should we ever
try to re-index the whole corpus.

On the other hand, as you can see from the screenshot the read search
performance is just awesome, you can try some searches yourself at
ImageShack - Best place for all of your image hosting and image sharing needs

-Jack

Hi Jack,

Thanks for sharing! I'm curious: can you provide any throughput
numbers? Ie, how many concurrent user requests you're able to serve
from this cluster and rough latency numbers. Do you use the geospatial
capacities at all?

Thanks,
Nick

On Feb 19, 9:58 pm, Jack Levin magn...@gmail.com wrote:

Hello all, I thought I would share some stats on how we are using
Elasticsearch.

At yfrog, we are indexing corpus of tweets that users use in
conjunction with photos uploaded to twitter from mobile and web. Our
total index is about 950 million docs. We run 500 shards across 12
servers, which are 8 core x 16GB, and some are 12 core x 32GB boxes.
Routing allows us to provide timeline search (per user), so that each
user's data is always in one shard. Here is a screenshot btw:http://a.yfrog.com/img703/8096/xil.png

Each shard is about 1G large, which makes total data to be about 1TB
with 50% of being replicated (e.g. 1 replica). The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.

We are currently indexing 3ml new docs per day (new tweets). So far
the challenges have been to keep file descriptors low, we are
basically running with ulimit -n 256000. Merging down to appropriate
number of segments under shard is always a must, but that
significantly taxes CPU, and slows down bulk indexing should we ever
try to re-index the whole corpus.

On the other hand, as you can see from the screenshot the read search
performance is just awesome, you can try some searches yourself athttp://yfrog.com/user/TechCrunch/profile

-Jack