Multiple indices vs. multiple shards approach


(Vladi Feigin) #1

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or as
opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the data
of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But in
our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

--
This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the
addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3105fe7f-ed68-4ff8-9db0-cac857a6622a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Matías Waisgold) #2

I had a project with the same context. We decided to increase the # of
shards as it was impossible to have one index for each customer.
Another approach is to have only some customers (hardcoded) separated from
the rest. If you can, in advance, detect this users it might be a good idea
and then have a "Rest of the world" index for non important ones.

Also when we increased the # of shards, we incremented the amount of
servers but with smaller ones, that improved a lot our failure resiliency.
Hope that helps.

On Friday, March 20, 2015 at 5:28:55 PM UTC+1, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or as
opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the data
of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But in
our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #3

This is where you use routing and aliases.

Use routing to send each customers documents to a specific shard, you can
then query using the same routing value and reduce your exposure. Then use
aliases so you can easily move larger customers out to their own index if
need be.

On 20 March 2015 at 09:53, Matías Waisgold mwaisgold@gmail.com wrote:

I had a project with the same context. We decided to increase the # of
shards as it was impossible to have one index for each customer.
Another approach is to have only some customers (hardcoded) separated from
the rest. If you can, in advance, detect this users it might be a good idea
and then have a "Rest of the world" index for non important ones.

Also when we increased the # of shards, we incremented the amount of
servers but with smaller ones, that improved a lot our failure resiliency.
Hope that helps.

On Friday, March 20, 2015 at 5:28:55 PM UTC+1, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or
as opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the data
of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But in
our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_qoqsjdsp8YyVgHH1ieOC8aTnqpvHSdL3WPMHsnCv7OA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(cdahlqvist) #4

Hi,

You could get around this by using routing based on customer ID when
indexing and searching. This will ensure that all documents belonging to a
single customer will be located in the same shard, which means that each
search for a specific customer can hit a single shard instead of all 9,
which makes it scale better.

Best regards,

Christian

On Friday, March 20, 2015 at 4:28:55 PM UTC, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or as
opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the data
of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But in
our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db03ac4c-bef8-40a4-a109-d37b40c8b463%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Vladi Feigin) #5

Thank you Mark
Can you please elaborate regarding the routing? Are you meaning using
customer id as a routing value?
Can you give an example? Link?
Should I override the shard calculation function?
בתאריך 20 במרץ 2015 19:43, מאת "Mark Walkom" markwalkom@gmail.com:

This is where you use routing and aliases.

Use routing to send each customers documents to a specific shard, you can
then query using the same routing value and reduce your exposure. Then use
aliases so you can easily move larger customers out to their own index if
need be.

On 20 March 2015 at 09:53, Matías Waisgold mwaisgold@gmail.com wrote:

I had a project with the same context. We decided to increase the # of
shards as it was impossible to have one index for each customer.
Another approach is to have only some customers (hardcoded) separated
from the rest. If you can, in advance, detect this users it might be a good
idea and then have a "Rest of the world" index for non important ones.

Also when we increased the # of shards, we incremented the amount of
servers but with smaller ones, that improved a lot our failure resiliency.
Hope that helps.

On Friday, March 20, 2015 at 5:28:55 PM UTC+1, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or
as opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the
data of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But
in our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OhfFUiygbMM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_qoqsjdsp8YyVgHH1ieOC8aTnqpvHSdL3WPMHsnCv7OA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_qoqsjdsp8YyVgHH1ieOC8aTnqpvHSdL3WPMHsnCv7OA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the
addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACvWdiogwAwtJFQBpXF9Hxa6m7_XDjW5vqN0XCunADC8A-O1ow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #6

There is a whole bunch of good stuff in the docs so I'd suggest you start
there -
http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-routing

Don't play with the hashing/sharding algorithm unless you know exactly what
you are doing!

On 20 March 2015 at 11:34, Vladi Feigin vladif@liveperson.com wrote:

Thank you Mark
Can you please elaborate regarding the routing? Are you meaning using
customer id as a routing value?
Can you give an example? Link?
Should I override the shard calculation function?
בתאריך 20 במרץ 2015 19:43, מאת "Mark Walkom" markwalkom@gmail.com:

This is where you use routing and aliases.

Use routing to send each customers documents to a specific shard, you can
then query using the same routing value and reduce your exposure. Then use
aliases so you can easily move larger customers out to their own index if
need be.

On 20 March 2015 at 09:53, Matías Waisgold mwaisgold@gmail.com wrote:

I had a project with the same context. We decided to increase the # of
shards as it was impossible to have one index for each customer.
Another approach is to have only some customers (hardcoded) separated
from the rest. If you can, in advance, detect this users it might be a good
idea and then have a "Rest of the world" index for non important ones.

Also when we increased the # of shards, we incremented the amount of
servers but with smaller ones, that improved a lot our failure resiliency.
Hope that helps.

On Friday, March 20, 2015 at 5:28:55 PM UTC+1, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or
as opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the
data of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But
in our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/974e95e4-8c25-4500-8823-853806bd5cbb%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/OhfFUiygbMM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_qoqsjdsp8YyVgHH1ieOC8aTnqpvHSdL3WPMHsnCv7OA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_qoqsjdsp8YyVgHH1ieOC8aTnqpvHSdL3WPMHsnCv7OA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACvWdiogwAwtJFQBpXF9Hxa6m7_XDjW5vqN0XCunADC8A-O1ow%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CACvWdiogwAwtJFQBpXF9Hxa6m7_XDjW5vqN0XCunADC8A-O1ow%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9t3_ApV%3DGC5hBT%3D%2BNYOiLnZAAGMAAxi-uoTO3YTH9BBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Vladi Feigin) #7

Thank you everybody for the help!
Is there a way to run routing in a debug mode? For example calculate the
shard id via API ?
Thank you,
Vladi

On Friday, March 20, 2015 at 6:28:55 PM UTC+2, Vladi Feigin wrote:

Hello,

Please share your thoughts
We have one big ES index and 18 shards (9 primary and 9 replicas)
We have thousands of customers and each customer could have millions or as
opposite very small number of documents
We never search across all customers but within a specific customer. In
other words all our queries have a customer id filter.
The big disadvantage of having one big index is we always search the data
of all customers rather than looking in one customer
Obviously it hurts our queries performance.
We're thinking to create multiple indexes : an index per customer. But in
our case it means having hundreds or maybe thousands indexes
In terms of the maintenance is a big overhead
Other approach is create many shards
Could you, please share your experience and thoughts?
What would you recommend in this scenario
Thank you in advance,
Vladi Feigin

This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of
the addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
This message may contain confidential and/or privileged information.
If you are not the addressee or authorized to receive this on behalf of the
addressee you must not use, copy, disclose or take action based on this
message or any information herein.
If you have received this message in error, please advise the sender
immediately by reply email and delete this message. Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4970faca-d353-4046-9204-e97cbe613f5e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#8

It's been 2 years since this discussion was active :slight_smile:
But I want to add a link to Shay Banon lecture about that exact issue:
https://vimeo.com/44716955
Could be a good use for new users who encounter the same issue.


(Aaron Mildenstein) #9

Just looking at a few of the slides there, and the fact that it's Shay presenting, I'm going to go out on a limb here and state that while many of the principles in here remain true, some of them are no longer true, and we're giving different advice. This is mostly due to the fact that Lucene has changed much since this talk, and many of the reasons we did things the way we did were to work around things that have been updated/fixed/changed (in Lucene), and no longer apply.

My recommendation: Get up-to-date recommendations from the Elasticsearch team before latching on to potentially outdated material.


#10

I'm sorry, didn't mean to confuse anyone. I think it might be useful to state in the video description that the video is not relevant for further versions of ElasticSearch . In addition, it would be really great to update the Elasticsearch: The Definitive Guide for such cases. Thanks for the advise!