Advice on "sharded" client setup

Heh, confusing subject, sorry.

Our application shards on client: we have a separate Postgres database per
client. So I naturally gravitated towards creating a separate
Elasticsearch index per client. After perusing this group some, I realize
that was a mistake: I now have a single node "cluster" that has over 1000
shards.

I've read some messages suggesting the way to go in this situation is this:

  1. Create a single index with 20-30 shards (or however large you want
    your cluster to be able to grow to).
  2. Create an alias per client with filter on, say, field client_id.
  3. Optionally specify routing on the alias.

So I have a few questions about this setup.

The "primary key" in Elasticsearch is _id and _type, correct? So I'm going
to have to change my code to set _id to "client_id:id"? Or will ES allow
for the following two documents:

_id: 123
_type: "Type1"
client_id: "Client1"

_id: 123
_type: "Type1"
client_id: "Client2"

We're leaning towards not specifying the routing in the alias because we're
afraid of creating hotspots; we just want each "client" evenly distributed
across all shards, and will rely on adding nodes and increasing replication
to handle scaling of reads. Does that sound reasonable?

Now for the crummy part. Each of our client's documents will have
different fields. For example, we have a document type
"Application::Profile". For Client1, the fields might be [a, b, c], but
for Client2 the fields will be [d, e, f]. So I see two ways to solve this
problem:

  1. Define type "Application::Profile" to have fields that are a superset
    of all the fields of all the clients.
  2. Define different types for each client:
    "Application::Profile/Client1", "Application::Profile/Client2"

Any suggestions? I don't really like either one of those solutions and am
considering just continuing with the idea of 1 index per client, but reduce
the number of shards per index to down to 1, then just adding nodes. This
still has issues though, like hotspots.

Thanks for the help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Chris

The "primary key" in Elasticsearch is _id and _type, correct? So I'm
going to have to change my code to set _id to "client_id:id"? Or will
ES allow for the following two documents:

_id: 123
_type: "Type1"
client_id: "Client1"

_id: 123
_type: "Type1"
client_id: "Client2"

The unique identifier is index/type/id (and that is "index", not
"alias") so your two documents above will be considered to be the same
document and one will overwrite the other. You will need to integrate
the client_id into the id itself.

We're leaning towards not specifying the routing in the alias because
we're afraid of creating hotspots; we just want each "client" evenly
distributed across all shards, and will rely on adding nodes and
increasing replication to handle scaling of reads. Does that sound
reasonable?

You can do that, but I wouldn't worry about creating hotspots. If you
find that you have one client which is much bigger than the others, then
just create a separate index for that client. Then update the client's
alias to point to the new index instead of the communal index. Problem
solved.

The beauty of having one routing value per client in the communal index
is that your queries for that client only need to hit one shard.

Now for the crummy part. Each of our client's documents will have
different fields. For example, we have a document type
"Application::Profile". For Client1, the fields might be [a, b, c],
but for Client2 the fields will be [d, e, f]. So I see two ways to
solve this problem:

 1. Define type "Application::Profile" to have fields that are a
    superset of all the fields of all the clients.
 2. Define different types for each client:
     "Application::Profile/Client1",
    "Application::Profile/Client2"

Any suggestions? I don't really like either one of those solutions
and am considering just continuing with the idea of 1 index per
client, but reduce the number of shards per index to down to 1, then
just adding nodes. This still has issues though, like hotspots.

Fields don't need to have values. So you can have a superset of all
fields, and in your documents just use the fields you need for each
client. The only time there is a conflict is when client_1 wants the
'name' field to be analyzed in one way, and client_2 wants the 'name'
field to be analyzed in another. The solution to this is to use two
different field names.

hth

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.