First of all, it will help with your sizing both your data and the number
of nodes you will need. As to systems that don't have a limit to the number
of docs, it really depends on how the data flows. In theory, my blog has no
limit on the number posts it will have, but I am not going to post 1
billion posts on it.
First, the most common case. You run your tests, and see that a single
shard can hold 100 million docs. In your wildest expectation, you are not
going to cross 1 billion. Thats simple, have an index with 10 shards,
enough hardware, and be done with it. Thats where most system fall, yet
many "think" they need to support unlimited data, and over-engineer.
For continuos stream of data, like log files, you can use the ability to
create indices on the fly, so you can have an index per week / month and so
on. A system that index the tweets that come along can use that as well.
You can easily search over several indices.
Another type is data that can be highly partitioned. User based data for
example. You can use the username to do the routing (both when indexing and
searching), and create a considerable number of shards. The number of
shards does not affect search - on limited HW - since you don't search on
all of them, you route based on user name. For example, you can create a
200 shard system on 4 nodes and that will work fine.
On Fri, Jan 13, 2012 at 7:11 AM, Nick Hoffman firstname.lastname@example.org wrote:
On Thursday, 12 January 2012 18:17:23 UTC-5, kimchy wrote:
Yea, deciding the number requires a bit of work. What I usually recommend
people is to create a single index with a single shard, on a single
instance and "shove" data to it. Find that upper limit based on your data
set and search requirements. You should get a number out of it. Something
like N docs, or X size. Then, you can extrapolate from there.
Interesting. Once you find the max number of docs that can be stored in a
single instance, for example, what exactly do you extrapolate from that
I guess some applications will have a finite number of docs indexed, which
would enable you to calculate how many shards you'll need when you reach
that finite number.
However, if there's no upper limit to the number of documents that you'll
index, I'm not sure how knowing how many docs can be stored in an instance