I will try to answer some points...
On Sun, Jul 31, 2011 at 9:30 PM, TheDeveloper email@example.com:
I'd be looking to store potentially hundreds of documents a second in
the cluster and have that searchable. Most of the searches should be
relatively simple field matches, but most likely include some facet
searching and compound filtering too.
You need to test yourself. The important thing is how complex your documents
will be. Splitting your index into several shards will help in indexing
performance. From my experience I can tell you that indexing a few hundreds
of documents per second is possible (indexing non trivial documents into
index with three shards in parallel using hundred threads and the biggest
limiting factor is network throughput, but your setup may vary).
To make new documents searchable the refresh operation has to be performed.
I think it is by default executed once per second. It can be explicitly
executed via API as well.
You can read more about it here:
and more generally here:
Also you should check
some options how to make indexing more efficient.
I'm interest to know how elastic search copes with a consistent
barrage of data and a growing index (potentially to very large sizes).
When a new index is created you specify number of shards, so theoretically
you can design for a large number of shards. However, even with a lot of
shards you can hit a limit of the cluster (once each shard is on dedicated
machine and can not grow more). There is a notion of index aliases to
effectively cope with this
As a result you can dynamically add/remove indices and give then the same
alias to allow searching across them.
To make search more performant you can consider adding more replicas to the
index (assuming you have idle machines for them). Number of replicas are set
at index creation and can be also changed dynamically later.
How much does the size of the index impact the performance of search
queries? Are there any published data/graphs on this, or existing
I am not aware of any publications and graphs. It would be nice to see some.
As for real world large scale implementations you can check
Additionally, from what I understand, underlying the search indexes is
essentially a distributed k/v system. What is the distribution model
employed here? How are the keys physically sharded across servers? Is
it using consistent hashing methodology, and if so does this mean the
cluster is subject to significant key rebalancing should a node come
You can check the following presentation by Shay Banon:
video recording hasn't been released yet but I believe it will be soon
available, just keep watching this ML)
There are even some O() notations
As you will learn from the presentation the distribution model is to
partition index into several shards. Each shard is in fact a standalone
fully functional Lucene index. Sounds simple, right? Once a document is
indexed (ie it is allocated to particular index shard) it is never
reallocated to a different shard - documents do not move between shards. If
this shard goes down and you do not have any replica of it then you will get
incomplete search results (and you will learn about that from the search
response, you can also learn about it from the admin cluster health API
Either way I'm playing around and am very impressed with what I've
seen so far.
On Jul 31, 6:46 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:
I think your questions are quite general. Can you share more details
what exactly you are looking for?
For example the search results can contain only document IDs of few top
results or highlighted text, faceted data (aggregations) ... etc. Also
are several search query types when it comes to the distributed search
As for exact time complexity I can not speak for Shay but I do not think
will get "exact" numbers here, may be only high-level estimates. The
is that apart from search itself there are going on other things behind
scene (like Lucene index merges, node data recovery, gateway shapshots,
index refresh ...) and although they happen in background they do impact
performance not to say that you can change several configuration options
the fly to tune performance for specific situations.
On Sun, Jul 31, 2011 at 7:03 PM, TheDeveloper <geoffwagst...@gmail.com
Can someone give me an idea of the upper limits to which ElasticSearch
is able to scale? In terms of factors like document size, number of
items indexed, and the performance impact at scale.
For example, what kind of time complexity are we looking at for
searching against an increasing number documents. Is it O(1), O(N), or
logarithmic? How does the number of nodes impact this?
I have some ideas of building some very cool stuff with