Users data flow

Hi,

I am trying to figure out the best way to design my ES cluster. Currently
my search service is subscription based and each user can only search his
own data.

So looking around I found several examples about users data flow and the
way of using aliases and its all straight forward.

One thing that I am struggling to understand is the routing setup. Now lets
assume that I started an index named "accounts" with 100 primary shards and
1 replica. Now users started subscribing so I start creating an alias per
user and routing each alias to a specific shard (1, 2, 3, ..., 100).

No if 100 users have already subscribed and a new user comes along, can I
route the new user to the first shard? Or should I start another index for
the next 100 users?

My concern with this also is performance. Lets say that both nodes are
running on a Quad core CPU with 32 GB of ram? Is there a good indicator of
how many shards I should allocate per index assuming that each document is
around 512KB in size?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8338a395-6143-4899-bc9e-6145399cc4a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You don’t need one shard per user unless each user has a very big amount of data.
Using routing is good as all documents for a given user will go to the same shard. But also documents from other users will go to that shard.

That’s not an issue. Use filters to filter your user data based on their login/id/whatever.
The cool thing you could do then is to use aliases. Create one alias per user and set within this alias the routing key AND the filter.

Then just query the alias instead of the index name and you’re done.

If a new user come, just add the new alias and index its data. It will go to whichever shard. You don’t really need to worry about it.

Then another question might be « how many shard I will need » and the answer is it depends but I would say: try to keep it as minimal as possible.

Make sense?

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 23 avr. 2015 à 09:30, Zaid Amir redserpent7@gmail.com a écrit :

Hi,

I am trying to figure out the best way to design my ES cluster. Currently my search service is subscription based and each user can only search his own data.

So looking around I found several examples about users data flow and the way of using aliases and its all straight forward.

One thing that I am struggling to understand is the routing setup. Now lets assume that I started an index named "accounts" with 100 primary shards and 1 replica. Now users started subscribing so I start creating an alias per user and routing each alias to a specific shard (1, 2, 3, ..., 100).

No if 100 users have already subscribed and a new user comes along, can I route the new user to the first shard? Or should I start another index for the next 100 users?

My concern with this also is performance. Lets say that both nodes are running on a Quad core CPU with 32 GB of ram? Is there a good indicator of how many shards I should allocate per index assuming that each document is around 512KB in size?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8338a395-6143-4899-bc9e-6145399cc4a4%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/8338a395-6143-4899-bc9e-6145399cc4a4%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/AF4B3B13-E15C-447F-AB2E-0A4DB22EE0F6%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

So then what is the benefit of using aliases as opposed to using one index
and filtered queries?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ebfe9a0-13ce-403a-8083-c446e2d9eb82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

So then what is the benefit of using aliases as opposed to using one index
and filtered queries? From what I've read, aliases and routing can give a
boost in queries since the index knows on which shards the documents are
located, but now you are saying that it does not matter since users' data
can be assigned to any shard. Is that correct?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ece778a2-d8ed-4601-bca2-8f0237da42e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aliases help to avoid developper bugs!
Basically, imagine you forgot to apply the filter in one of your queries… Your user will see everything.
Also, aliases might help you to secure your access to users data. If you are using Nginx or Shield, you can say that this user A has only access to localhost:9200/a which is an alias.

I meant that users data can be assigned to any shard but all documents will go to the same shard whichever it is.

Let say you have 2 shards in index named docs.

All docs for user A will go to shard 0
All docs for user B will go to shard 1
All docs for user C will go to shard 0
All docs for user D will go to shard 1

If you query index docs, you have access to all docs.
If you query index docs with routing A or C, you will have access to docs for users A AND C
If you query index docs with routing A or C with a filter A, you will have access to docs for users A only.

If you define an alias A with routing A and filter on A, then if you query alias A you will have access to docs for users A only.

You can read the Definitive guide and in particular this section:

Designing for Scale | Elasticsearch: The Definitive Guide [2.x] | Elastic http://www.elastic.co/guide/en/elasticsearch/guide/current/scale.html
Faking Index per User with Aliases | Elasticsearch: The Definitive Guide [2.x] | Elastic http://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 23 avr. 2015 à 10:05, Zaid Amir redserpent7@gmail.com a écrit :

So then what is the benefit of using aliases as opposed to using one index and filtered queries? From what I've read, aliases and routing can give a boost in queries since the index knows on which shards the documents are located, but now you are saying that it does not matter since users' data can be assigned to any shard. Is that correct?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ece778a2-d8ed-4601-bca2-8f0237da42e3%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/ece778a2-d8ed-4601-bca2-8f0237da42e3%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/65AA0604-EEA1-4120-BEE3-2446B2FC453C%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Thanks for the explanation it is clear now.

Now for the other part of my question. Lets assume that I am expecting this
index to hold data for 1000 users. Each user will have 500,000 documents
and each document will be 512KB. Now, these documents are pure text files.
And lets say that my query will only search the field that holds the file
contents and will only return the file names.

Lets assume the cluster contains two nodes, each node has a Quad Core Cpu
and 16GB of RAM and the heap size is set to 8GB on each node.

So with that example, how many shards you would say that I need to get a
relatively fast search.

I know its hard to calculate but I would love to find a way to at least
estimate how many shards I would need since this cannot be increased in the
future.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8007da62-9f41-48ab-acbb-67b200acf952%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

If I have calculated correctly, that corresponds to about 238TB of raw
data. If this is the size of JSON documents being indexed in Elasticsearch,
you will definitely need more than 2 nodes.

The good thing about using aliases the way David describes is that you will
not need to put all users in the same index as the aliases hides the
underlying index and makes that transparent. You can therefore e.g. put
your first 100 customers in one index and then add new indices as the
number of customers grow. This makes it easier to incrementally handle
growth.

Best regards,

Christian

On Thursday, April 23, 2015 at 8:30:52 AM UTC+1, Zaid Amir wrote:

Hi,

I am trying to figure out the best way to design my ES cluster. Currently
my search service is subscription based and each user can only search his
own data.

So looking around I found several examples about users data flow and the
way of using aliases and its all straight forward.

One thing that I am struggling to understand is the routing setup. Now
lets assume that I started an index named "accounts" with 100 primary
shards and 1 replica. Now users started subscribing so I start creating an
alias per user and routing each alias to a specific shard (1, 2, 3, ...,
100).

No if 100 users have already subscribed and a new user comes along, can I
route the new user to the first shard? Or should I start another index for
the next 100 users?

My concern with this also is performance. Lets say that both nodes are
running on a Quad core CPU with 32 GB of ram? Is there a good indicator of
how many shards I should allocate per index assuming that each document is
around 512KB in size?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a432734a-e840-4e24-9c7a-21f6efed8e97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the reply, yes I know I exaggerated a bit there :slight_smile: but to be
honest finding a starting point for my cluster is driving me nuts. And how
many shards/index is just not clear. Now I know each shard is a Lucene
instance and having many instances running on a single node is a bad
practice. But the question that remain is "How many is too many?"

So lets assume that my first index will only support 100 users. Given the
same size of data how many shards should I allocate for that index? Is
there a way to estimate or calculate that?

On Thursday, April 23, 2015 at 2:22:23 PM UTC+3,
christian...@elasticsearch.com wrote:

Hi,

If I have calculated correctly, that corresponds to about 238TB of raw
data. If this is the size of JSON documents being indexed in Elasticsearch,
you will definitely need more than 2 nodes.

The good thing about using aliases the way David describes is that you
will not need to put all users in the same index as the aliases hides the
underlying index and makes that transparent. You can therefore e.g. put
your first 100 customers in one index and then add new indices as the
number of customers grow. This makes it easier to incrementally handle
growth.

Best regards,

Christian

On Thursday, April 23, 2015 at 8:30:52 AM UTC+1, Zaid Amir wrote:

Hi,

I am trying to figure out the best way to design my ES cluster. Currently
my search service is subscription based and each user can only search his
own data.

So looking around I found several examples about users data flow and the
way of using aliases and its all straight forward.

One thing that I am struggling to understand is the routing setup. Now
lets assume that I started an index named "accounts" with 100 primary
shards and 1 replica. Now users started subscribing so I start creating an
alias per user and routing each alias to a specific shard (1, 2, 3, ...,
100).

No if 100 users have already subscribed and a new user comes along, can I
route the new user to the first shard? Or should I start another index for
the next 100 users?

My concern with this also is performance. Lets say that both nodes are
running on a Quad core CPU with 32 GB of ram? Is there a good indicator of
how many shards I should allocate per index assuming that each document is
around 512KB in size?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/86b979ac-1e3f-4e3a-991e-32c8f63444cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.