We have an application contains over 50,000 user-generated tables (each with
user-defined fields and data, etc…) in a relational db, and new tables are
created on the fly. Some of the tables have large data sets and heavy query
loads, whereas other tables have minimal data/minimal query load. Basically,
we do not have the ability to know in advance the ideal # of ES ‘shards’ and
‘replicas’ to assign to each table.
We are considering to use ES to provide full-text/faceted search for these
tables and had the following questions:
-
According to
http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic-EC2-environment-tp975766p994333.htmlfrom July 2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-Replica-s-configuration-values-tp1593807p1595071.html
from September 2010 - for each ES index, the # of shards is fixed while the
of replicas can change. However, according to this post from June 2011 (
http://www.stumbleupon.com/devblog/searching-for-serendipity/), it says ‘Changing
the replica or shard count in response to load is easily done at runtime’.
Does ES now support changing the shard count for an existing index on the
fly, or did the stumbleupon post have a typo?
-
Given that each table has its own usage pattern, should we create a
separate ES index for each table? This would cause over 50,000 ES indexes to
be created and potentially a lot more in the future. Would the large number
of indexes cause a problem in ES? Is there a max limit before performance
would decrease?
-
If there are say 50,000 ES indexes, each configured with an average
of 2 shards, would that mean each _all query would need to touch 100,000
shards? If so, would that create a significant latency problem? Suppose that
we host all 50,000 ES indexes on 20 servers (nodes), would that improve
performance versus say hosting the 50,000 indexes on 40 servers (due to less
network traffic)? Or, is the problem of having to touch 100,000 shards
simply too great of a problem? The key question seems to be whether the
querying performance is affected by the # of shards (the more shards the
slower the queries), or by the number of servers (nodes)?
-
If #3 is a problem, should we instead just create a single ES index
with say 10 shards and 2 replicas (so max 20 nodes for now). We would then
able to increase replicas later to increase querying performance. But in
order to increase indexing performance, we would need to performance a full
re-index of everything on a new ES index with more shards. Is that a better
solution? Of course, we could potentially have say 3 indexes (usage pattern
A, usage pattern B and usage pattern C) each with a different shard/replica
setting, but the concept is the same.
-
On a more general standpoint, when a search occurs in ES for _all,
does ES actually issue the query to EVERY shard in EVERY index separately,
or is there a more efficient mechanism?
Thanks!