Multi-tenancy and multiple routing values

So I'm noodling around and trying to determine the best way to setup a
multi-user ES cluster such that each user is an administrator to a portion
of the ES cluster. Each administrator could theoretically have their own
multi-tenant setup, where their index is routed via the users in their
index, etc.

My first thought was create an index for each administrator, which allows
them to have full range of routing configuration. However, this doesn't
scale past a few thousand indicies because of the lucene overhead (unless I
want to keep throwing hardware at the problem). I could spin up entirely
new clusters when the index ceiling is reached, although this may get
tricky managing multiple clusters.

I could implement a "User dataflow" system with routing, such as described
by Shay in his Berlin Buzzwords presentation. Each document indexed to a
particular administrator is assigned an admin_id property, and all
subsequent searches are filtered/aliased/routed on that field. However,
this means that the administrator is unable to use routing themselves (say,
if they need to implement "user dataflow" themselves).

Is there a way to setup routed-routing? Something like:

  • Admin_id = 5 routes to shards 1,2,3,4
    • User_id = 100 routes to shard 3, User_id = 200 routes to shard 1,
      etc
  • Admin_id = 6 routes to shards 5,6,7,8
    • Widget_id = 10 routes to shard 5, Widget_id = 20 routes to shard 7,
      etc

Alternatively, I could simply disable routing and say "tough luck", but it
is such a powerful performance reason I would hate to disable it. At this
point, I think the best solution is to spin up entirely new ES clusters
when the index ceiling is reached.

Thanks!
-Zach

--

Unless the admins require to share data among them and you have enough
nodes, why not just deploy a cluster per admin?

What I understand from Shay's Berlin Buzz, the user dataflow is for user
actions (search & index), but not for ES admin actions with higher impact
to the cluster settings.

Jörg

--

Sharing data is not necessary...I had in fact planned on creating an
authorization system with a reverse proxy to segment admins to their own
data.

How much overhead does a ES instance have, outside the memory heap memory
that is assigned to it? My concern with a cluster per admin is that I do
not know ahead of time the requirements of a particular admin, and
therefore resources will over/under-provisioned.

One admin may only need a single index with 10k documents, while another
may require several indicies and 100m documents. I may also have a large
number of admins (up to a hundred). I'm not sure how I would provision
these clusters with appropriate memory a priori...I imagine full cluster
restarts would be necessary if an admin outgrows their memory requirements.

I assumed that it would be more efficient in (terms of hardware) to segment
everyone into a single cluster, rather than try to balance individual
cluster requirements.

I also realize this is a use-case that ES was not exactly designed for, so
kludgy solutions are going to happen regardless =)

Thanks for the help!
-Zach

Unless the admins require to share data among them and you have enough

nodes, why not just deploy a cluster per admin?

What I understand from Shay's Berlin Buzz, the user dataflow is for user
actions (search & index), but not for ES admin actions with higher impact
to the cluster settings.

Jörg

--

After thinking some more, the "cluster per admin" solution may be best,
particularly because of OOM situations. If an admin accidentally causes an
OOM due to a large facet, it would take down the entire node (including the
data belonging to other admins). Whereas individual clusters will only
shutdown the shards associated with a particular cluster and not everyone
else.

I'll pay a small memory penalty due to loading up JVM for each ES instance
(somewhere in the 100-200mb range, best I can tell), but this is probably a
small thing compared to knocking out service for other users.

-Zach

On Sunday, January 13, 2013 3:00:49 PM UTC-5, Zachary Tong wrote:

Sharing data is not necessary...I had in fact planned on creating an
authorization system with a reverse proxy to segment admins to their own
data.

How much overhead does a ES instance have, outside the memory heap memory
that is assigned to it? My concern with a cluster per admin is that I do
not know ahead of time the requirements of a particular admin, and
therefore resources will over/under-provisioned.

One admin may only need a single index with 10k documents, while another
may require several indicies and 100m documents. I may also have a large
number of admins (up to a hundred). I'm not sure how I would provision
these clusters with appropriate memory a priori...I imagine full
cluster restarts would be necessary if an admin outgrows their memory
requirements.

I assumed that it would be more efficient in (terms of hardware) to
segment everyone into a single cluster, rather than try to balance
individual cluster requirements.

I also realize this is a use-case that ES was not exactly designed for, so
kludgy solutions are going to happen regardless =)

Thanks for the help!
-Zach

Unless the admins require to share data among them and you have enough

nodes, why not just deploy a cluster per admin?

What I understand from Shay's Berlin Buzz, the user dataflow is for user
actions (search & index), but not for ES admin actions with higher impact
to the cluster settings.

Jörg

--