Scaling strategies without shard splitting

Hi -

My team has used Solr in it's single-node configuration (without SolrCloud)
for a few years now. In our current product we are now looking at
transitioning to SolrCloud, but before we made that leap I wanted to also
take a good look at whether ElasticSearch would be a better fit for our
needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

  • We use one index ("collection" in Solr) per customer.
  • The indexes are going to vary quite a bit in size, following something
    like a power-law distribution with many small indexes (let's guess < 250k
    documents), some medium sized indexes (up to a few million documents) and a
    few large indexes (hundreds of millions of documents).
  • So the number of shards required per index will vary greatly, and will
    be hard to predict accurately at creation time.

How do people generally approach this kind of problem? Do you just make a
best guess at the appropriate number of shards for each new index and then
do a full re-index (with more shards) if the number of documents grows
bigger than expected?

Thanks!

  • Ian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ded96e32-e1f1-4d09-8356-7367c86b1166%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose ianrose@fullstory.com wrote:

Hi -

My team has used Solr in it's single-node configuration (without
SolrCloud) for a few years now. In our current product we are now looking
at transitioning to SolrCloud, but before we made that leap I wanted to
also take a good look at whether Elasticsearch would be a better fit for
our needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

  • We use one index ("collection" in Solr) per customer.
  • The indexes are going to vary quite a bit in size, following something
    like a power-law distribution with many small indexes (let's guess < 250k
    documents), some medium sized indexes (up to a few million documents) and a
    few large indexes (hundreds of millions of documents).
  • So the number of shards required per index will vary greatly, and will
    be hard to predict accurately at creation time.

How do people generally approach this kind of problem? Do you just make a
best guess at the appropriate number of shards for each new index and then
do a full re-index (with more shards) if the number of documents grows
bigger than expected?

I'm in a pretty similar boat and have done just fine without shard
splitting. I maintain the search index for about 900 wikis
http://noc.wikimedia.org/conf/all.dblist. Each wiki gets two
Elasticsearch indexes and those indexes vary in size, update rate, and
query rate a ton. Most wikis get a single shard for all of there indexes
but many of them use more
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828.
I basically just guestimated and reindexed the ones that were too big into
more shards.

We have a script that creates a new index with new configuration and then
copies all the document from the old index to the new one and then swap the
aliases (that we use for updates and queries) to the new index. Then it
re-does any updates or deletes that occurred since copy script started.
Having something like that is pretty common. I rarely use it to change
sharding configuration - its much more common that I'll use it to change
how a field in the document is analyzed.

Elasticsearch also has another way to handle this problem (we don't use it
for other reasons) where you create a single index for all customers and
then filter them at query time. You also add routing values to your
documents and queries so all documents from the same customer get routed to
the same shard. That way you can serve queries for a single customer out
of one shard which is pretty cool. For larger customers that don't fit on
a single shard you still create indexes just for them.

One thing to watch out for, though, is that Elasticsearch doesn't use the
shard's size when determining where to place the shard. It'll check to
make sure the shard won't fill the disk beyond some percentage but it won't
try to spread out the large shards so you can get somewhat unbalanced disk
usage. I have an open pull request for something to do that so probably
won't be true forever but it is true for now.

How big are your documents and how frequently do you think you'll need
shard splitting? If your documents are pretty small you may be able to get
away with just reindexing all of them for the customer when you need more
shards like I do. It sure isn't optimal but it gets the job done.

Another way to do things is once your customers get too big you create a
new index and route all of their new data there. You have to query both
indexes. This is kindof how people handle log messages and it might
work, depending on your use case.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hey Nik -

Thanks for the response.

  • Ian

On Mon, Oct 13, 2014 at 4:28 PM, Nikolas Everett nik9000@gmail.com wrote:

On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose ianrose@fullstory.com wrote:

Hi -

My team has used Solr in it's single-node configuration (without
SolrCloud) for a few years now. In our current product we are now looking
at transitioning to SolrCloud, but before we made that leap I wanted to
also take a good look at whether Elasticsearch would be a better fit for
our needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

  • We use one index ("collection" in Solr) per customer.
  • The indexes are going to vary quite a bit in size, following something
    like a power-law distribution with many small indexes (let's guess < 250k
    documents), some medium sized indexes (up to a few million documents) and a
    few large indexes (hundreds of millions of documents).
  • So the number of shards required per index will vary greatly, and will
    be hard to predict accurately at creation time.

How do people generally approach this kind of problem? Do you just make
a best guess at the appropriate number of shards for each new index and
then do a full re-index (with more shards) if the number of documents grows
bigger than expected?

I'm in a pretty similar boat and have done just fine without shard
splitting. I maintain the search index for about 900 wikis
http://noc.wikimedia.org/conf/all.dblist. Each wiki gets two
Elasticsearch indexes and those indexes vary in size, update rate, and
query rate a ton. Most wikis get a single shard for all of there indexes
but many of them use more
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828.
I basically just guestimated and reindexed the ones that were too big into
more shards.

We have a script that creates a new index with new configuration and then
copies all the document from the old index to the new one and then swap the
aliases (that we use for updates and queries) to the new index. Then it
re-does any updates or deletes that occurred since copy script started.
Having something like that is pretty common. I rarely use it to change
sharding configuration - its much more common that I'll use it to change
how a field in the document is analyzed.

Elasticsearch also has another way to handle this problem (we don't use it
for other reasons) where you create a single index for all customers and
then filter them at query time. You also add routing values to your
documents and queries so all documents from the same customer get routed to
the same shard. That way you can serve queries for a single customer out
of one shard which is pretty cool. For larger customers that don't fit on
a single shard you still create indexes just for them.

One thing to watch out for, though, is that Elasticsearch doesn't use the
shard's size when determining where to place the shard. It'll check to
make sure the shard won't fill the disk beyond some percentage but it won't
try to spread out the large shards so you can get somewhat unbalanced disk
usage. I have an open pull request for something to do that so probably
won't be true forever but it is true for now.

How big are your documents and how frequently do you think you'll need
shard splitting? If your documents are pretty small you may be able to get
away with just reindexing all of them for the customer when you need more
shards like I do. It sure isn't optimal but it gets the job done.

Another way to do things is once your customers get too big you create a
new index and route all of their new data there. You have to query both
indexes. This is kindof how people handle log messages and it might
work, depending on your use case.

Nik

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/5JTYFC93jS8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALswpfCnkpu3RqAOowcNhZCAaxyT_Q9CKbv28Y_uEKA0UDjyiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

In my use case I have indexed a union catalog for some hundred libraries,
where each library can have a search service, plus adding their own catalog
data they do not want to share.

Elasticsearch offers far more flexibility and performance than Solr with
the ability of automatic extending the cluster by adding nodes (without
configuration change) combined with automatic rebalancing of shards, plus
the feature of index aliases and shard over-allocation, an explanation is
here:
http://elasticsearch-users.115913.n3.nabble.com/Over-allocation-of-shards-td3673978.html

With index aliases, I do not have to perform evil things like shard
splitting. No index copy required, no full re-index.

That is, I can organize some library catalog index over the machines, and
address an "index view" for each library by assigning several index aliases
(e.g. collection names or library identifiers) to the library catalog
segments they are interested in, with term filters. Index updates come from
a single point of a primary data base plus data packages the libraries can
upload. If the number of input data exceeds the capacity, I can simply
start a new node, without touching the configuration.

Also, releasing new index versions is a snap with Elasticsearch. The index
names carry timestamp information (e.g. ddMMyyHH) and it is easy to
organize index versions like rolling windows, with the latest index being
the current one to search. Old indices are dropped if the are no longer
needed.

Jörg

On Mon, Oct 13, 2014 at 8:12 PM, Ian Rose ianrose@fullstory.com wrote:

Hi -

My team has used Solr in it's single-node configuration (without
SolrCloud) for a few years now. In our current product we are now looking
at transitioning to SolrCloud, but before we made that leap I wanted to
also take a good look at whether Elasticsearch would be a better fit for
our needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

  • We use one index ("collection" in Solr) per customer.
  • The indexes are going to vary quite a bit in size, following something
    like a power-law distribution with many small indexes (let's guess < 250k
    documents), some medium sized indexes (up to a few million documents) and a
    few large indexes (hundreds of millions of documents).
  • So the number of shards required per index will vary greatly, and will
    be hard to predict accurately at creation time.

How do people generally approach this kind of problem? Do you just make a
best guess at the appropriate number of shards for each new index and then
do a full re-index (with more shards) if the number of documents grows
bigger than expected?

Thanks!

  • Ian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ded96e32-e1f1-4d09-8356-7367c86b1166%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ded96e32-e1f1-4d09-8356-7367c86b1166%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHWv1bNZ571cu64VArC-H9cZ60snV8qRuPcj4JCqsVrBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.