Relational->ES index schema strategy (1 index per table or all tables in 1 index)

datadev · September 16, 2011, 8:14am

We have an application contains over 50,000 user-generated tables (each with
user-defined fields and data, etc…) in a relational db, and new tables are
created on the fly. Some of the tables have large data sets and heavy query
loads, whereas other tables have minimal data/minimal query load. Basically,
we do not have the ability to know in advance the ideal # of ES ‘shards’ and
‘replicas’ to assign to each table.

We are considering to use ES to provide full-text/faceted search for these
tables and had the following questions:

```
  According to 
```

http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic-EC2-environment-tp975766p994333.htmlfrom July 2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-Replica-s-configuration-values-tp1593807p1595071.html
from September 2010 - for each ES index, the # of shards is fixed while the

of replicas can change. However, according to this post from June 2011 (

http://www.stumbleupon.com/devblog/searching-for-serendipity/), it says ‘Changing
the replica or shard count in response to load is easily done at runtime’.
Does ES now support changing the shard count for an existing index on the
fly, or did the stumbleupon post have a typo?

  Given that each table has its own usage pattern, should we create a

separate ES index for each table? This would cause over 50,000 ES indexes to
be created and potentially a lot more in the future. Would the large number
of indexes cause a problem in ES? Is there a max limit before performance
would decrease?

  If there are say 50,000 ES indexes, each configured with an average

of 2 shards, would that mean each _all query would need to touch 100,000
shards? If so, would that create a significant latency problem? Suppose that
we host all 50,000 ES indexes on 20 servers (nodes), would that improve
performance versus say hosting the 50,000 indexes on 40 servers (due to less
network traffic)? Or, is the problem of having to touch 100,000 shards
simply too great of a problem? The key question seems to be whether the
querying performance is affected by the # of shards (the more shards the
slower the queries), or by the number of servers (nodes)?

  If #3 is a problem, should we instead just create a single ES index

with say 10 shards and 2 replicas (so max 20 nodes for now). We would then
able to increase replicas later to increase querying performance. But in
order to increase indexing performance, we would need to performance a full
re-index of everything on a new ES index with more shards. Is that a better
solution? Of course, we could potentially have say 3 indexes (usage pattern
A, usage pattern B and usage pattern C) each with a different shard/replica
setting, but the concept is the same.

  On a more general standpoint, when a search occurs in ES for _all,

does ES actually issue the query to EVERY shard in EVERY index separately,
or is there a more efficient mechanism?

Thanks!

kimchy · September 18, 2011, 1:09pm

On Fri, Sep 16, 2011 at 11:14 AM, datadev nji@adinfocenter.com wrote:

We have an application contains over 50,000 user-generated tables (each
with user-defined fields and data, etc…) in a relational db, and new tables
are created on the fly. Some of the tables have large data sets and heavy
query loads, whereas other tables have minimal data/minimal query load.
Basically, we do not have the ability to know in advance the ideal # of ES
‘shards’ and ‘replicas’ to assign to each table.

We are considering to use ES to provide full-text/faceted search for these
tables and had the following questions:
  According to
http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic-EC2-environment-tp975766p994333.htmlfrom July 2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-Replica-s-configuration-values-tp1593807p1595071.html
from September 2010 - for each ES index, the # of shards is fixed while
the # of replicas can change. However, according to this post from June 2011
(http://www.stumbleupon.com/devblog/searching-for-serendipity/), it says ‘Changing
the replica or shard count in response to load is easily done at runtime’.
Does ES now support changing the shard count for an existing index on the
fly, or did the stumbleupon post have a typo?

The number of shards can still ont be changed. I believe the post is
referring to how stumbleupon uses elasticsearch, where indices keep being
created either in rolling fashion or some other fashion, and in this case,
new indices can have different number of shards.

  Given that each table has its own usage pattern, should we create
a separate ES index for each table? This would cause over 50,000 ES indexes
to be created and potentially a lot more in the future. Would the large
number of indexes cause a problem in ES? Is there a max limit before
performance would decrease?

The limit for a large number of indices is how the cluster can handle it in
terms of resources. It boils down to how many shards one will have on a
single node. Each shard is a Lucene index, and it comes at a resource cost
(file descriptors, memory and so on). Hard to give numbers as to how many
shards one can have a single node, but, 50k sounds too much and you can
probably use a different schema which will require less machines in the
cluster to sustain.

  If there are say 50,000 ES indexes, each configured with an
average of 2 shards, would that mean each _all query would need to touch
100,000 shards? If so, would that create a significant latency problem?
Suppose that we host all 50,000 ES indexes on 20 servers (nodes), would that
improve performance versus say hosting the 50,000 indexes on 40 servers (due
to less network traffic)? Or, is the problem of having to touch 100,000
shards simply too great of a problem? The key question seems to be whether
the querying performance is affected by the # of shards (the more shards the
slower the queries), or by the number of servers (nodes)?

"Touching" 100k shards on a single request over 20 or 40 nodes will overload
the cluster. Thats too many shards and too little machines for a single
request.

When you query, the search requests per shard are sent in an async manner,
so there is no blocking aspect to each shard, but, with so little number of
machines compared to teh number of shards touched on each request, it will
end up overloading them.

  If #3 is a problem, should we instead just create a single ES
index with say 10 shards and 2 replicas (so max 20 nodes for now). We would
then able to increase replicas later to increase querying performance. But
in order to increase indexing performance, we would need to performance a
full re-index of everything on a new ES index with more shards. Is that a
better solution? Of course, we could potentially have say 3 indexes (usage
pattern A, usage pattern B and usage pattern C) each with a different
shard/replica setting, but the concept is the same.

Yes, that would be the better solution. Note, 10 shards with 2 replicas tops
out at 30 machines (the 2 is the additional replicas).

Another option that you have is to use more shards, lets say 100, and use
custom routing values per user. This will end up having each user "stored"
in a single shard. When searching, if you provide that routing value, it
will also only hit a single shard (so you don't have the search over 100
shards overhead).

Thats, assuming that the more common use case is one where users do a
search, and full cross users search is less common (though still possible).
Also note, when searching, you can provide several values for the routing,
so you can search over several users at one go.

The above is made really simple in 0.17 with the usage of aliases, and the
ability to have both a routing value and a filter associated with an alias.
You can have an alias called "user_xxx", with a routing value of xxx
associated with it, and a filter (to return only that user data). When you
search against "users_xxx", you will get the optimized single shard search,
and the automatic filtering to only return that user data.

  On a more general standpoint, when a search occurs in ES for
_all, does ES actually issue the query to EVERY shard in EVERY index
separately, or is there a more efficient mechanism?

It issues a search to every shard separately. We could potentially do a
serial search on each node for the search across shards, but that will cost
latency wise.

Thanks!****

Liyu · October 1, 2011, 2:07am

Shay,

I have a slightly different problem. We are planning a multi-tenant
deployment in which number of tenants (accounts) are going to be
around 1000, and I want to achieve 2 goals

Restrict each account only to access its own data
Enable the admins to search _all

There are 2 approaches I can think of

Put data from all accounts into one index, but implement the
restriction/control by inserting a query condition to limit the search
within its own data which is the data "tagged" with its account id.
In this case, the challenge is to come up with a generic query
condition for this purpose, which I think is kind of challenging due
to the diversified nature and complexity of the query DSL.
Put each document into 2 indices, one is the combined master, the
other one is the account specific index. For this approach, I have the
concern of handling too many indexes/shards, even if each query will
have a small number (say, 5) of shards involved, but the total number
of shards will be thousands. Is my concern valid?

Do you have any suggestions?

Thanks in advance.

-- Liyu

On Sep 18, 6:09 am, Shay Banon kim...@gmail.com wrote:

On Fri, Sep 16, 2011 at 11:14 AM, datadev n...@adinfocenter.com wrote:

We have an application contains over 50,000 user-generated tables (each
with user-defined fields and data, etc…) in a relational db, and new tables
are created on the fly. Some of the tables have large data sets and heavy
query loads, whereas other tables have minimal data/minimal query load.
Basically, we do not have the ability to know in advance the ideal # of ES
‘shards’ and ‘replicas’ to assign to each table.

We are considering to use ES to provide full-text/faceted search for these
tables and had the following questions:
  According to
http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic...July 2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-R...
from September 2010 - for each ES index, the # of shards is fixed while
the # of replicas can change. However, according to this post from June 2011
(http://www.stumbleupon.com/devblog/searching-for-serendipity/), it says ‘Changing
the replica or shard count in response to load is easily done at runtime’.
Does ES now support changing the shard count for an existing index on the
fly, or did the stumbleupon post have a typo?
The number of shards can still ont be changed. I believe the post is
referring to how stumbleupon uses elasticsearch, where indices keep being
created either in rolling fashion or some other fashion, and in this case,
new indices can have different number of shards.
  Given that each table has its own usage pattern, should we create
a separate ES index for each table? This would cause over 50,000 ES indexes
to be created and potentially a lot more in the future. Would the large
number of indexes cause a problem in ES? Is there a max limit before
performance would decrease?
The limit for a large number of indices is how the cluster can handle it in
terms of resources. It boils down to how many shards one will have on a
single node. Each shard is a Lucene index, and it comes at a resource cost
(file descriptors, memory and so on). Hard to give numbers as to how many
shards one can have a single node, but, 50k sounds too much and you can
probably use a different schema which will require less machines in the
cluster to sustain.
  If there are say 50,000 ES indexes, each configured with an
average of 2 shards, would that mean each _all query would need to touch
100,000 shards? If so, would that create a significant latency problem?
Suppose that we host all 50,000 ES indexes on 20 servers (nodes), would that
improve performance versus say hosting the 50,000 indexes on 40 servers (due
to less network traffic)? Or, is the problem of having to touch 100,000
shards simply too great of a problem? The key question seems to be whether
the querying performance is affected by the # of shards (the more shards the
slower the queries), or by the number of servers (nodes)?
"Touching" 100k shards on a single request over 20 or 40 nodes will overload
the cluster. Thats too many shards and too little machines for a single
request.

When you query, the search requests per shard are sent in an async manner,
so there is no blocking aspect to each shard, but, with so little number of
machines compared to teh number of shards touched on each request, it will
end up overloading them.
  If #3 is a problem, should we instead just create a single ES
index with say 10 shards and 2 replicas (so max 20 nodes for now). We would
then able to increase replicas later to increase querying performance. But
in order to increase indexing performance, we would need to performance a
full re-index of everything on a new ES index with more shards. Is that a
better solution? Of course, we could potentially have say 3 indexes (usage
pattern A, usage pattern B and usage pattern C) each with a different
shard/replica setting, but the concept is the same.
Yes, that would be the better solution. Note, 10 shards with 2 replicas tops
out at 30 machines (the 2 is the additional replicas).

Another option that you have is to use more shards, lets say 100, and use
custom routing values per user. This will end up having each user "stored"
in a single shard. When searching, if you provide that routing value, it
will also only hit a single shard (so you don't have the search over 100
shards overhead).

Thats, assuming that the more common use case is one where users do a
search, and full cross users search is less common (though still possible).
Also note, when searching, you can provide several values for the routing,
so you can search over several users at one go.

The above is made really simple in 0.17 with the usage of aliases, and the
ability to have both a routing value and a filter associated with an alias.
You can have an alias called "user_xxx", with a routing value of xxx
associated with it, and a filter (to return only that user data). When you
search against "users_xxx", you will get the optimized single shard search,
and the automatic filtering to only return that user data.
  On a more general standpoint, when a search occurs in ES for
_all, does ES actually issue the query to EVERY shard in EVERY index
separately, or is there a more efficient mechanism?
It issues a search to every shard separately. We could potentially do a
serial search on each node for the search across shards, but that will cost
latency wise.

Thanks!****

Liyu · October 2, 2011, 4:59am

Looks like I can always wrap the original query with a into a "filtered"
clause and add an "account" filter. But if will be nice to know the answer
for question 2.

Thanks,

-- Liyu

On Fri, Sep 30, 2011 at 7:07 PM, Liyu liyuyi@gmail.com wrote:

Shay,

I have a slightly different problem. We are planning a multi-tenant
deployment in which number of tenants (accounts) are going to be
around 1000, and I want to achieve 2 goals

Restrict each account only to access its own data

Enable the admins to search _all

There are 2 approaches I can think of

Put data from all accounts into one index, but implement the
restriction/control by inserting a query condition to limit the search
within its own data which is the data "tagged" with its account id.
In this case, the challenge is to come up with a generic query
condition for this purpose, which I think is kind of challenging due
to the diversified nature and complexity of the query DSL.

Put each document into 2 indices, one is the combined master, the
other one is the account specific index. For this approach, I have the
concern of handling too many indexes/shards, even if each query will
have a small number (say, 5) of shards involved, but the total number
of shards will be thousands. Is my concern valid?

Do you have any suggestions?

Thanks in advance.

-- Liyu

On Sep 18, 6:09 am, Shay Banon kim...@gmail.com wrote:
On Fri, Sep 16, 2011 at 11:14 AM, datadev n...@adinfocenter.com wrote:

We have an application contains over 50,000 user-generated tables (each
with user-defined fields and data, etc…) in a relational db, and new
tables
are created on the fly. Some of the tables have large data sets and
heavy
query loads, whereas other tables have minimal data/minimal query load.
Basically, we do not have the ability to know in advance the ideal # of
ES
‘shards’ and ‘replicas’ to assign to each table.

We are considering to use ES to provide full-text/faceted search for
these
tables and had the following questions:
  According to
http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic...July2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-R.
..
from September 2010 - for each ES index, the # of shards is fixed
while
the # of replicas can change. However, according to this post from June
2011
(http://www.stumbleupon.com/devblog/searching-for-serendipity/), it
says ‘Changing
the replica or shard count in response to load is easily done at
runtime’.
Does ES now support changing the shard count for an existing index on
the
fly, or did the stumbleupon post have a typo?

The number of shards can still ont be changed. I believe the post is
referring to how stumbleupon uses elasticsearch, where indices keep being
created either in rolling fashion or some other fashion, and in this
case,
new indices can have different number of shards.
  Given that each table has its own usage pattern, should we
create
a separate ES index for each table? This would cause over 50,000 ES
indexes
to be created and potentially a lot more in the future. Would the large
number of indexes cause a problem in ES? Is there a max limit before
performance would decrease?

The limit for a large number of indices is how the cluster can handle it
in
terms of resources. It boils down to how many shards one will have on a
single node. Each shard is a Lucene index, and it comes at a resource
cost
(file descriptors, memory and so on). Hard to give numbers as to how many
shards one can have a single node, but, 50k sounds too much and you can
probably use a different schema which will require less machines in the
cluster to sustain.
  If there are say 50,000 ES indexes, each configured with an
average of 2 shards, would that mean each _all query would need to
touch
100,000 shards? If so, would that create a significant latency problem?
Suppose that we host all 50,000 ES indexes on 20 servers (nodes), would
that
improve performance versus say hosting the 50,000 indexes on 40 servers
(due
to less network traffic)? Or, is the problem of having to touch 100,000
shards simply too great of a problem? The key question seems to be
whether
the querying performance is affected by the # of shards (the more
shards the
slower the queries), or by the number of servers (nodes)?
"Touching" 100k shards on a single request over 20 or 40 nodes will
overload
the cluster. Thats too many shards and too little machines for a single
request.

When you query, the search requests per shard are sent in an async
manner,
so there is no blocking aspect to each shard, but, with so little number
of
machines compared to teh number of shards touched on each request, it
will
end up overloading them.
  If #3 is a problem, should we instead just create a single ES
index with say 10 shards and 2 replicas (so max 20 nodes for now). We
would
then able to increase replicas later to increase querying performance.
But
in order to increase indexing performance, we would need to performance
a
full re-index of everything on a new ES index with more shards. Is that
a
better solution? Of course, we could potentially have say 3 indexes
(usage
pattern A, usage pattern B and usage pattern C) each with a different
shard/replica setting, but the concept is the same.
Yes, that would be the better solution. Note, 10 shards with 2 replicas
tops
out at 30 machines (the 2 is the additional replicas).

Another option that you have is to use more shards, lets say 100, and use
custom routing values per user. This will end up having each user
"stored"
in a single shard. When searching, if you provide that routing value, it
will also only hit a single shard (so you don't have the search over 100
shards overhead).

Thats, assuming that the more common use case is one where users do a
search, and full cross users search is less common (though still
possible).
Also note, when searching, you can provide several values for the
routing,
so you can search over several users at one go.

The above is made really simple in 0.17 with the usage of aliases, and
the
ability to have both a routing value and a filter associated with an
alias.
You can have an alias called "user_xxx", with a routing value of xxx
associated with it, and a filter (to return only that user data). When
you
search against "users_xxx", you will get the optimized single shard
search,
and the automatic filtering to only return that user data.
  On a more general standpoint, when a search occurs in ES for
_all, does ES actually issue the query to EVERY shard in EVERY index
separately, or is there a more efficient mechanism?
It issues a search to every shard separately. We could potentially do a
serial search on each node for the search across shards, but that will
cost
latency wise.

Thanks!****

Jurgen_kartnaller · October 2, 2011, 7:00am

On Sat, Oct 1, 2011 at 4:07 AM, Liyu liyuyi@gmail.com wrote:

Shay,

I have a slightly different problem. We are planning a multi-tenant
deployment in which number of tenants (accounts) are going to be
around 1000, and I want to achieve 2 goals

Restrict each account only to access its own data

Enable the admins to search _all

There are 2 approaches I can think of

Put data from all accounts into one index, but implement the
restriction/control by inserting a query condition to limit the search
within its own data which is the data "tagged" with its account id.
In this case, the challenge is to come up with a generic query
condition for this purpose, which I think is kind of challenging due
to the diversified nature and complexity of the query DSL.

Put each document into 2 indices, one is the combined master, the
other one is the account specific index. For this approach, I have the

You don't need a "master" index, use a "master" alias.

concern of handling too many indexes/shards, even if each query will
have a small number (say, 5) of shards involved, but the total number
of shards will be thousands. Is my concern valid?

Do you have any suggestions?

Thanks in advance.

-- Liyu

On Sep 18, 6:09 am, Shay Banon kim...@gmail.com wrote:
On Fri, Sep 16, 2011 at 11:14 AM, datadev n...@adinfocenter.com wrote:

We have an application contains over 50,000 user-generated tables (each
with user-defined fields and data, etc…) in a relational db, and new
tables
are created on the fly. Some of the tables have large data sets and
heavy
query loads, whereas other tables have minimal data/minimal query load.
Basically, we do not have the ability to know in advance the ideal # of
ES
‘shards’ and ‘replicas’ to assign to each table.

We are considering to use ES to provide full-text/faceted search for
these
tables and had the following questions:
  According to
http://elasticsearch-users.115913.n3.nabble.com/Using-ES-in-a-dynamic...July2010 and
http://elasticsearch-users.115913.n3.nabble.com/Choosing-Shards-and-R.
..
from September 2010 - for each ES index, the # of shards is fixed
while
the # of replicas can change. However, according to this post from June
2011
(http://www.stumbleupon.com/devblog/searching-for-serendipity/), it
says ‘Changing
the replica or shard count in response to load is easily done at
runtime’.
Does ES now support changing the shard count for an existing index on
the
fly, or did the stumbleupon post have a typo?

The number of shards can still ont be changed. I believe the post is
referring to how stumbleupon uses elasticsearch, where indices keep being
created either in rolling fashion or some other fashion, and in this
case,
new indices can have different number of shards.
  Given that each table has its own usage pattern, should we
create
a separate ES index for each table? This would cause over 50,000 ES
indexes
to be created and potentially a lot more in the future. Would the large
number of indexes cause a problem in ES? Is there a max limit before
performance would decrease?

The limit for a large number of indices is how the cluster can handle it
in
terms of resources. It boils down to how many shards one will have on a
single node. Each shard is a Lucene index, and it comes at a resource
cost
(file descriptors, memory and so on). Hard to give numbers as to how many
shards one can have a single node, but, 50k sounds too much and you can
probably use a different schema which will require less machines in the
cluster to sustain.
  If there are say 50,000 ES indexes, each configured with an
average of 2 shards, would that mean each _all query would need to
touch
100,000 shards? If so, would that create a significant latency problem?
Suppose that we host all 50,000 ES indexes on 20 servers (nodes), would
that
improve performance versus say hosting the 50,000 indexes on 40 servers
(due
to less network traffic)? Or, is the problem of having to touch 100,000
shards simply too great of a problem? The key question seems to be
whether
the querying performance is affected by the # of shards (the more
shards the
slower the queries), or by the number of servers (nodes)?
"Touching" 100k shards on a single request over 20 or 40 nodes will
overload
the cluster. Thats too many shards and too little machines for a single
request.

When you query, the search requests per shard are sent in an async
manner,
so there is no blocking aspect to each shard, but, with so little number
of
machines compared to teh number of shards touched on each request, it
will
end up overloading them.
  If #3 is a problem, should we instead just create a single ES
index with say 10 shards and 2 replicas (so max 20 nodes for now). We
would
then able to increase replicas later to increase querying performance.
But
in order to increase indexing performance, we would need to performance
a
full re-index of everything on a new ES index with more shards. Is that
a
better solution? Of course, we could potentially have say 3 indexes
(usage
pattern A, usage pattern B and usage pattern C) each with a different
shard/replica setting, but the concept is the same.
Yes, that would be the better solution. Note, 10 shards with 2 replicas
tops
out at 30 machines (the 2 is the additional replicas).

Another option that you have is to use more shards, lets say 100, and use
custom routing values per user. This will end up having each user
"stored"
in a single shard. When searching, if you provide that routing value, it
will also only hit a single shard (so you don't have the search over 100
shards overhead).

Thats, assuming that the more common use case is one where users do a
search, and full cross users search is less common (though still
possible).
Also note, when searching, you can provide several values for the
routing,
so you can search over several users at one go.

The above is made really simple in 0.17 with the usage of aliases, and
the
ability to have both a routing value and a filter associated with an
alias.
You can have an alias called "user_xxx", with a routing value of xxx
associated with it, and a filter (to return only that user data). When
you
search against "users_xxx", you will get the optimized single shard
search,
and the automatic filtering to only return that user data.
  On a more general standpoint, when a search occurs in ES for
_all, does ES actually issue the query to EVERY shard in EVERY index
separately, or is there a more efficient mechanism?
It issues a search to every shard separately. We could potentially do a
serial search on each node for the search across shards, but that will
cost
latency wise.

Thanks!****

--
http://www.sfgdornbirn.at
http://www.mcb-bregenz.at