On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose ianrose@fullstory.com wrote:
Hi -
My team has used Solr in it's single-node configuration (without
SolrCloud) for a few years now. In our current product we are now looking
at transitioning to SolrCloud, but before we made that leap I wanted to
also take a good look at whether Elasticsearch would be a better fit for
our needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:
- We use one index ("collection" in Solr) per customer.
- The indexes are going to vary quite a bit in size, following something
like a power-law distribution with many small indexes (let's guess < 250k
documents), some medium sized indexes (up to a few million documents) and a
few large indexes (hundreds of millions of documents).
- So the number of shards required per index will vary greatly, and will
be hard to predict accurately at creation time.
How do people generally approach this kind of problem? Do you just make a
best guess at the appropriate number of shards for each new index and then
do a full re-index (with more shards) if the number of documents grows
bigger than expected?
I'm in a pretty similar boat and have done just fine without shard
splitting. I maintain the search index for about 900 wikis
http://noc.wikimedia.org/conf/all.dblist. Each wiki gets two
Elasticsearch indexes and those indexes vary in size, update rate, and
query rate a ton. Most wikis get a single shard for all of there indexes
but many of them use more
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828.
I basically just guestimated and reindexed the ones that were too big into
more shards.
We have a script that creates a new index with new configuration and then
copies all the document from the old index to the new one and then swap the
aliases (that we use for updates and queries) to the new index. Then it
re-does any updates or deletes that occurred since copy script started.
Having something like that is pretty common. I rarely use it to change
sharding configuration - its much more common that I'll use it to change
how a field in the document is analyzed.
Elasticsearch also has another way to handle this problem (we don't use it
for other reasons) where you create a single index for all customers and
then filter them at query time. You also add routing values to your
documents and queries so all documents from the same customer get routed to
the same shard. That way you can serve queries for a single customer out
of one shard which is pretty cool. For larger customers that don't fit on
a single shard you still create indexes just for them.
One thing to watch out for, though, is that Elasticsearch doesn't use the
shard's size when determining where to place the shard. It'll check to
make sure the shard won't fill the disk beyond some percentage but it won't
try to spread out the large shards so you can get somewhat unbalanced disk
usage. I have an open pull request for something to do that so probably
won't be true forever but it is true for now.
How big are your documents and how frequently do you think you'll need
shard splitting? If your documents are pretty small you may be able to get
away with just reindexing all of them for the customer when you need more
shards like I do. It sure isn't optimal but it gets the job done.
Another way to do things is once your customers get too big you create a
new index and route all of their new data there. You have to query both
indexes. This is kindof how people handle log messages and it might
work, depending on your use case.
Nik
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.