How to scale an ES deployment to millions of tenants with different data schemas

zivs · September 17, 2014, 1:05pm

Hi,
we are considering using ES as a primary data-source for a new project.
our data is generated by millions of different users, each having a
relatively small number of documents, yet each having a different data
schema.

we are considering several approaches:

index per user - we are concerned with scaling the ES cluster to support
millions of indexes, each having relatively small number of docs.
all users colocated on a single index - we are concerned that an ES index
will not support millions of different fields (as each user has a different
data schema).
mix of the two above - having X users colocated on a single index, and
having Y such indexes to host our entire user population.
implementing some kind of a "mapping layer" that maps users' schema onto
generic fields in one or more indexes.
this would probably work, but of course is harder to implement & maintain.

so my questions:

are there production deployments out there that have a million active
indexes? what do they look like?
how many different fields does it make sense to host in a single index?
would it scale to millions of fields in a single index?
are there other ways to go about this that we have overlooked?

thanks!!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f98651e6-5e6a-4ed8-aba8-b5e91078f036%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itamar_Syn_Hershko · September 17, 2014, 1:21pm

First, you should really read this:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch regarding using ES
as a single source of truth

Millions of indexes is not advisable, unless you plan on having millions of
servers. Depending on index size and write frequency to them, you don't
want to have more than a few dozen indexes per machine (including
replicas). This is because of concerns of memory, CPU, I/O and file
descriptors.

One big single index may present its own problems due to the different
schemas, although it may be solvable using dynamic index templates
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html#dynamic-templates.
I will still expect you to have issues with number of shards (basically,
running out of shards at some point).

Therefore I will try and find a middle way here, using probably some sort
of a mapping mechanism. Even also time based if its applicable.

Re your questions:

are there production deployments out there that have a million active
indexes? what do they look like?

I'm not aware of such

how many different fields does it make sense to host in a single index?
would it scale to millions of fields in a single index?

You mean in a single document. I recall seeing Shay suggesting not to go
over the 100 threshold or so. Lucene really isn't optimized for scaling
vertically, especially in the document level.

are there other ways to go about this that we have overlooked?

Maybe look at your data model and try to re-arrange it.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsNarKqML9q4E%3DvdrDp80mC_rS5VXfJWkT2%3D7WghzwACg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

zivs · September 17, 2014, 1:35pm

thanks for the prompt reply!
one thing though - when using a single multi-tenant index, my concerns are
not around the number of fields per doc (which is small, less than 50),
but rather the fact that since each tenant has different fields, the
accumulated number of fields in such an index will be huge.

i.e. tenant 1 has fields F11..F1n, tenant 2 has fields F21..F2n, ...
these fields are distinct so the number of fields for the multi-tenant
index will grow to millions quickly.

will such an indexing methodology work in ES?

thanks!

On Wednesday, September 17, 2014 4:21:17 PM UTC+3, Itamar Syn-Hershko wrote:

First, you should really read this:
Jepsen: Elasticsearch regarding using ES
as a single source of truth

Millions of indexes is not advisable, unless you plan on having millions
of servers. Depending on index size and write frequency to them, you don't
want to have more than a few dozen indexes per machine (including
replicas). This is because of concerns of memory, CPU, I/O and file
descriptors.

One big single index may present its own problems due to the different
schemas, although it may be solvable using dynamic index templates
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html#dynamic-templates.
I will still expect you to have issues with number of shards (basically,
running out of shards at some point).

Therefore I will try and find a middle way here, using probably some sort
of a mapping mechanism. Even also time based if its applicable.

Re your questions:

are there production deployments out there that have a million active
indexes? what do they look like?

I'm not aware of such

how many different fields does it make sense to host in a single index?
would it scale to millions of fields in a single index?

You mean in a single document. I recall seeing Shay suggesting not to go
over the 100 threshold or so. Lucene really isn't optimized for scaling
vertically, especially in the document level.

are there other ways to go about this that we have overlooked?

Maybe look at your data model and try to re-arrange it.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/23a8484a-dcfc-4c8a-bc9d-a02bc4280985%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Itamar_Syn_Hershko · September 17, 2014, 1:38pm

This will still mean less overhead than having those distinct field in
discreet indexes. I wouldn't worry about that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Wed, Sep 17, 2014 at 4:35 PM, Ziv Shalev zivs@wix.com wrote:

thanks for the prompt reply!
one thing though - when using a single multi-tenant index, my concerns are
not around the number of fields per doc (which is small, less than 50),
but rather the fact that since each tenant has different fields, the
accumulated number of fields in such an index will be huge.

i.e. tenant 1 has fields F11..F1n, tenant 2 has fields F21..F2n, ...
these fields are distinct so the number of fields for the multi-tenant
index will grow to millions quickly.

will such an indexing methodology work in ES?

thanks!

On Wednesday, September 17, 2014 4:21:17 PM UTC+3, Itamar Syn-Hershko
wrote:

First, you should really read this: Aphyr - Posts
317-call-me-maybe-elasticsearch regarding using ES as a single source of
truth

Millions of indexes is not advisable, unless you plan on having millions
of servers. Depending on index size and write frequency to them, you don't
want to have more than a few dozen indexes per machine (including
replicas). This is because of concerns of memory, CPU, I/O and file
descriptors.

One big single index may present its own problems due to the different
schemas, although it may be solvable using dynamic index templates
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html#dynamic-templates.
I will still expect you to have issues with number of shards (basically,
running out of shards at some point).

Therefore I will try and find a middle way here, using probably some sort
of a mapping mechanism. Even also time based if its applicable.

Re your questions:

are there production deployments out there that have a million active
indexes? what do they look like?

I'm not aware of such

how many different fields does it make sense to host in a single index?
would it scale to millions of fields in a single index?

You mean in a single document. I recall seeing Shay suggesting not to go
over the 100 threshold or so. Lucene really isn't optimized for scaling
vertically, especially in the document level.

are there other ways to go about this that we have overlooked?

Maybe look at your data model and try to re-arrange it.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/23a8484a-dcfc-4c8a-bc9d-a02bc4280985%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/23a8484a-dcfc-4c8a-bc9d-a02bc4280985%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtjRiUhQPNL2i3TCr9ZNus%3DijMAZJ2J3P25vtHnxNGUag%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Scalability and creating 1 index per user Elasticsearch	4	894	July 6, 2017
Scaling ElasticSearch for many indexes Elasticsearch	2	18	October 22, 2024
Multi tenancy drawbacks Elasticsearch	3	333	October 8, 2021
Scaling: Cluster for speed or for size? Elasticsearch	6	356	July 6, 2017
Scalability questions Elasticsearch	6	427	July 6, 2017

How to scale an ES deployment to millions of tenants with different data schemas

Related topics