Hi,
Allow me to expand a bit on the answer above:
The number of shards in elasticsearch per index is fixed. Each shard
has 0 or more replicas. When an index is created, all shards are allocated
to the nodes (so you might end up with shard 1 and shard 2 allocated to the
same node). A shard and its replica are not allocated to the same node.
You maximum scaling out size (per index) is determined by
number_of_shards * (number_of_replicas + 1). This means that for the default
settings for 5 shards, each with one replica, you will reach the scaling
limit of that specific index with 10 machines.
Regarding sizing, thats a good question. Lets start with the number of
shards. The more shards you have, the better indexing TPS you will get
(assuming have enough machines for them to spread properly). The more shards
you have also mean more shards to "hit" when you do a search (since search
is distributed across shards). This can be a good thing since if you have
heavy search operations, you distribute the load, though the fact that more
shards need to process a search request might result in higher latency
(note, async IO is used for this, so no blocking threads or anything like
that).
The more replicas you have increate your high availability factor, and
since shard replicas participate in search requests, should spread the load
of search operations. On the other hand, when indexing, it means that the
indexing of a document will happen on a shard and all its replicas. Though
note that replication is done in parallel (so the difference between 1
replica and 2 replica is negligible (assuming you have enough machines to
spread the shards).
The number of shards is fixed. The ability to change the number of
shards will probably not happen in the near future, and it is problematic,
since with how things work (and with how search engines in general work), it
will mean complete reindexing of substantial part of your data. Changing the
number of replicas (in runtime) is something that can certainly be
implemented and is planned to be implemented.
To alleviate the problem of number of shards being fixed, one of the
benefits of elasticsearch is the fact that the number of indices is not.
And, since you can search on more than one index, you can devise a system
where indices are added dynamically. For example, for a log aggregation
application, you can decide to create an index per month. This allows you to
scale out indefinitely.
-shay.banon
On Sun, Jul 25, 2010 at 12:15 AM, James Cook jcook@tracermedia.comwrote:
Thanks for that information.
I am using an application architecture with the application doing all
Crate, Update, Delete functionality via an IMBD, and using Elastic Search
for queries on anything other than PK and also serving as the persistence
mechanism for the IMDB.
+-----------------------------------------------------------+
| Tomcat Application Server |
| +------------------------------------------------------+ |
| | Web App | |
| +------------------------------------------------------+ |
| | (Create, PK Lookup, | | |
| V and Delete) V | (Queries for |
| +-------------------------------+ | objects using |
| | Hazelcast IMDB (Autocluster) | | parameters |
| +-------------------------------+ | other than |
| | (write behind) | primary key) |
| V V |
| +------------------------------------------------------+ |
| | Elastic Search (Autocluster) | |
| +------------------------------------------------------+ |
+-----------------------------------------------------------+
This is what each of my EC2 instances look like, and they are load
balanced using AWS. Scalr monitors the load on each server to determine when
instances will be created or destroyed.
Scalability testing is scheduled for three weeks from now, so it will be
interesting to see this functioning in the wild.
I'd be very interested in other's "real world" experiences using ES on
EC2.
-- jim
On Sat, Jul 24, 2010 at 3:55 PM, Berkay Mollamustafaoglu <
mberkay@gmail.com> wrote:
James,
I'll give it a go to get things started
-
How many shards and replicas should I choose for this environment?
There is no way to answer this generically, it depends on how many
documents you need to index, how fast you need to index them, how often they
change, how many users will execute queries etc. As a general rule of
thumb, more shards to increase indexing (write) performance, and more
replicas increase search/read performance. Having higher number of machines
would increase the performance. ES adjusts the distribution of shards and
replicas dynamically as you add more machines.
-
Shard count is currently fixed (cannot be changed once index is
created). Having very high number of shards would introduce an overhead, so
may not be a good idea. Another option would be creating new indices with
higher number of shards when you need to and reindexing the data as. You can
use index labels to switch to new index once reindexing is complete, etc.
You can also consider using larger EC2 instances to scale up rather than
additional instances. For example, you can set to 5 shard and 1 replica and
use 5 small EC2 instances as the low point and switch to bigger instances as
needed by adding bigger ones and removing the smalller ones gradually.
Mostly I think you'll need to be able to profile the app somewhat first
before setting up such an automated system to scale up and down. My guess is
that the way that ES works, it would be computationally expensive to change
number of shards dynamically (essentially it would likely require
re-indexing of the whole data), hence you may need to think other ways of
scaling up and down automatically.
Hope this helps..
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
On Sat, Jul 24, 2010 at 11:05 AM, James Cook jcook@tracermedia.comwrote:
My apologies for bumping this, but it is important to me.
On Sat, Jul 17, 2010 at 10:16 PM, James Cook jcook@tracermedia.comwrote:
I am preparing to roll out a web application for QA testing on EC2.
This web application contains an embedded ES server. Scalr is used to spin
up new instances when load increases, and tears down instances when the load
decreases. I have a couple questions regarding ES and its operation in this
environment.
- How many shards and replicas should I choose for this
environment? I can restrict the minimum and maximum number of EC2 instances,
but I would prefer to leave the maximum open ended. I plan to set the
minimum number of instances to two or three mostly to ensure that most of
our data is cached in memory by Hazelcast which is acting as a IMDB in front
of Elastic Search.
- Is the shard count fixed? Can it be set dynamically based on the
number of instances I have running? If not, is it detrimental to specify a
very high number of shards (like 30 or more to cover cases if 10-15
instances of the web app are running), and what happens if I have that
number of shards and only two instances?