ElasticSearch at scale


(TheDeveloper) #1

Can someone give me an idea of the upper limits to which ElasticSearch
is able to scale? In terms of factors like document size, number of
items indexed, and the performance impact at scale.

For example, what kind of time complexity are we looking at for
searching against an increasing number documents. Is it O(1), O(N), or
logarithmic? How does the number of nodes impact this?

I have some ideas of building some very cool stuff with
ElasticSearch :slight_smile:

Thanks


(Lukáš Vlček) #2

Hi,

I think your questions are quite general. Can you share more details about
what exactly you are looking for?

For example the search results can contain only document IDs of few top
results or highlighted text, faceted data (aggregations) ... etc. Also there
are several search query types when it comes to the distributed search (see
http://www.elasticsearch.org/guide/reference/api/search/search-type.html for
more details).

As for exact time complexity I can not speak for Shay but I do not think you
will get "exact" numbers here, may be only high-level estimates. The thing
is that apart from search itself there are going on other things behind the
scene (like Lucene index merges, node data recovery, gateway shapshots,
index refresh ...) and although they happen in background they do impact
performance not to say that you can change several configuration options on
the fly to tune performance for specific situations.

Regards,
Lukas

On Sun, Jul 31, 2011 at 7:03 PM, TheDeveloper geoffwagstaff@gmail.comwrote:

Can someone give me an idea of the upper limits to which ElasticSearch
is able to scale? In terms of factors like document size, number of
items indexed, and the performance impact at scale.

For example, what kind of time complexity are we looking at for
searching against an increasing number documents. Is it O(1), O(N), or
logarithmic? How does the number of nodes impact this?

I have some ideas of building some very cool stuff with
ElasticSearch :slight_smile:

Thanks


(TheDeveloper) #3

Thanks Lukas.

I'd be looking to store potentially hundreds of documents a second in
the cluster and have that searchable. Most of the searches should be
relatively simple field matches, but most likely include some facet
searching and compound filtering too.

I'm interest to know how elastic search copes with a consistent
barrage of data and a growing index (potentially to very large sizes).
How much does the size of the index impact the performance of search
queries? Are there any published data/graphs on this, or existing
large-scale implementations?

Additionally, from what I understand, underlying the search indexes is
essentially a distributed k/v system. What is the distribution model
employed here? How are the keys physically sharded across servers? Is
it using consistent hashing methodology, and if so does this mean the
cluster is subject to significant key rebalancing should a node come
up/go down.

Either way I'm playing around and am very impressed with what I've
seen so far.

Geoff

On Jul 31, 6:46 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I think your questions are quite general. Can you share more details about
what exactly you are looking for?

For example the search results can contain only document IDs of few top
results or highlighted text, faceted data (aggregations) ... etc. Also there
are several search query types when it comes to the distributed search (seehttp://www.elasticsearch.org/guide/reference/api/search/search-type.htmlfor
more details).

As for exact time complexity I can not speak for Shay but I do not think you
will get "exact" numbers here, may be only high-level estimates. The thing
is that apart from search itself there are going on other things behind the
scene (like Lucene index merges, node data recovery, gateway shapshots,
index refresh ...) and although they happen in background they do impact
performance not to say that you can change several configuration options on
the fly to tune performance for specific situations.

Regards,
Lukas

On Sun, Jul 31, 2011 at 7:03 PM, TheDeveloper geoffwagst...@gmail.comwrote:

Can someone give me an idea of the upper limits to which ElasticSearch
is able to scale? In terms of factors like document size, number of
items indexed, and the performance impact at scale.

For example, what kind of time complexity are we looking at for
searching against an increasing number documents. Is it O(1), O(N), or
logarithmic? How does the number of nodes impact this?

I have some ideas of building some very cool stuff with
ElasticSearch :slight_smile:

Thanks


(Lukáš Vlček) #4

Hi,

I will try to answer some points...

On Sun, Jul 31, 2011 at 9:30 PM, TheDeveloper geoffwagstaff@gmail.comwrote:

Thanks Lukas.

I'd be looking to store potentially hundreds of documents a second in
the cluster and have that searchable. Most of the searches should be
relatively simple field matches, but most likely include some facet
searching and compound filtering too.

You need to test yourself. The important thing is how complex your documents
will be. Splitting your index into several shards will help in indexing
performance. From my experience I can tell you that indexing a few hundreds
of documents per second is possible (indexing non trivial documents into
index with three shards in parallel using hundred threads and the biggest
limiting factor is network throughput, but your setup may vary).

To make new documents searchable the refresh operation has to be performed.
I think it is by default executed once per second. It can be explicitly
executed via API as well.
You can read more about it here:
http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html
and more generally here:
http://www.elasticsearch.org/guide/reference/api/index_.html
Also you should check
http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.htmlfor
some options how to make indexing more efficient.

I'm interest to know how elastic search copes with a consistent
barrage of data and a growing index (potentially to very large sizes).

When a new index is created you specify number of shards, so theoretically
you can design for a large number of shards. However, even with a lot of
shards you can hit a limit of the cluster (once each shard is on dedicated
machine and can not grow more). There is a notion of index aliases to
effectively cope with this
http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html.
As a result you can dynamically add/remove indices and give then the same
alias to allow searching across them.

To make search more performant you can consider adding more replicas to the
index (assuming you have idle machines for them). Number of replicas are set
at index creation and can be also changed dynamically later.

How much does the size of the index impact the performance of search
queries? Are there any published data/graphs on this, or existing
large-scale implementations?

I am not aware of any publications and graphs. It would be nice to see some.
As for real world large scale implementations you can check
http://www.elasticsearch.org/users/

Additionally, from what I understand, underlying the search indexes is
essentially a distributed k/v system. What is the distribution model
employed here? How are the keys physically sharded across servers? Is
it using consistent hashing methodology, and if so does this mean the
cluster is subject to significant key rebalancing should a node come
up/go down.

You can check the following presentation by Shay Banon:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/elasticsearch-bbuzz2011.pdf(the
video recording hasn't been released yet but I believe it will be soon
available, just keep watching this ML)
There are even some O() notations :slight_smile:

As you will learn from the presentation the distribution model is to
partition index into several shards. Each shard is in fact a standalone
fully functional Lucene index. Sounds simple, right? Once a document is
indexed (ie it is allocated to particular index shard) it is never
reallocated to a different shard - documents do not move between shards. If
this shard goes down and you do not have any replica of it then you will get
incomplete search results (and you will learn about that from the search
response, you can also learn about it from the admin cluster health API
http://www.elasticsearch.org/guide/reference/api/admin-cluster-health.html).

Either way I'm playing around and am very impressed with what I've
seen so far.

Geoff

On Jul 31, 6:46 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I think your questions are quite general. Can you share more details
about
what exactly you are looking for?

For example the search results can contain only document IDs of few top
results or highlighted text, faceted data (aggregations) ... etc. Also
there
are several search query types when it comes to the distributed search
(seehttp://
www.elasticsearch.org/guide/reference/api/search/search-type.htmlfor
more details).

As for exact time complexity I can not speak for Shay but I do not think
you
will get "exact" numbers here, may be only high-level estimates. The
thing
is that apart from search itself there are going on other things behind
the
scene (like Lucene index merges, node data recovery, gateway shapshots,
index refresh ...) and although they happen in background they do impact
performance not to say that you can change several configuration options
on
the fly to tune performance for specific situations.

Regards,
Lukas

On Sun, Jul 31, 2011 at 7:03 PM, TheDeveloper <geoffwagst...@gmail.com
wrote:

Can someone give me an idea of the upper limits to which ElasticSearch
is able to scale? In terms of factors like document size, number of
items indexed, and the performance impact at scale.

For example, what kind of time complexity are we looking at for
searching against an increasing number documents. Is it O(1), O(N), or
logarithmic? How does the number of nodes impact this?

I have some ideas of building some very cool stuff with
ElasticSearch :slight_smile:

Thanks


(system) #5