Hello all, I thought I would share some stats on how we are using
At yfrog, we are indexing corpus of tweets that users use in
conjunction with photos uploaded to twitter from mobile and web. Our
total index is about 950 million docs. We run 500 shards across 12
servers, which are 8 core x 16GB, and some are 12 core x 32GB boxes.
Routing allows us to provide timeline search (per user), so that each
user's data is always in one shard. Here is a screenshot btw:
Each shard is about 1G large, which makes total data to be about 1TB
with 50% of being replicated (e.g. 1 replica). The other interesting
parts is that each server runs with 4x2TB disks, with ZFS over them.
So far we have had no issues with ZFS on linux, and servers have had
definite performance improvement.
We are currently indexing 3ml new docs per day (new tweets). So far
the challenges have been to keep file descriptors low, we are
basically running with ulimit -n 256000. Merging down to appropriate
number of segments under shard is always a must, but that
significantly taxes CPU, and slows down bulk indexing should we ever
try to re-index the whole corpus.
On the other hand, as you can see from the screenshot the read search
performance is just awesome, you can try some searches yourself at