When is one shard enough

I maintain search indexes for a bunch of wikis (hundreds), many of which
are small (hundreds of documents), some of which are pretty big (a few
million documents). I've noticed that having five shards per index for the
small wikis seems like overkill. It feels like the cluster is doing more
work maintaining index metadata for some wikis then it is for lucene
stuff. OTOH, my master node still has very low CPU usage. Does anyone run
with one or two shards for small indexes?

Caveats:

  1. Technically I split each wiki into two indexes to make the suggestion
    results better. The split ends up putting about 3/4 of the documents in
    one index and 1/4 in the other. In any case, that means even more shards
    with even less data in them for the small wikis.
  2. Update and search rate is mostly proportional to the number of
    documents.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1YbHrfWT2LcQ%3DmQQY-Mdi_m_cJ%2BT%3Dq9%2B-ALStD5GtJ5g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Why are you sharding on the first place? unless you know it to grow very
big for sure, I'd skip this all together. Planning in advance to avoid
possible resharding later can easily become over-engineering.

Depending on usage, I'd go with having a shard up to dozen or 2 of GBs.
That's an ideal size from my experience.

If you want to go the virtual shard route nevertheless, try using a
shard-key that will leave you with empty shards now and that you can bias
later (for example - by using dates).

HTH

On Thu, Dec 12, 2013 at 6:13 PM, Nikolas Everett nik9000@gmail.com wrote:

I maintain search indexes for a bunch of wikis (hundreds), many of which
are small (hundreds of documents), some of which are pretty big (a few
million documents). I've noticed that having five shards per index for the
small wikis seems like overkill. It feels like the cluster is doing more
work maintaining index metadata for some wikis then it is for lucene
stuff. OTOH, my master node still has very low CPU usage. Does anyone run
with one or two shards for small indexes?

Caveats:

  1. Technically I split each wiki into two indexes to make the suggestion
    results better. The split ends up putting about 3/4 of the documents in
    one index and 1/4 in the other. In any case, that means even more shards
    with even less data in them for the small wikis.
  2. Update and search rate is mostly proportional to the number of
    documents.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1YbHrfWT2LcQ%3DmQQY-Mdi_m_cJ%2BT%3Dq9%2B-ALStD5GtJ5g%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuG278srFsVTrmifgyN%2B2wU7A6VkdXPeoO518-TXzZVvg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thu, Dec 12, 2013 at 4:26 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Why are you sharding on the first place? unless you know it to grow very
big for sure, I'd skip this all together. Planning in advance to avoid
possible resharding later can easily become over-engineering.

I'll take that as a yes. I'm not sure if I'd consider accepting the
defaults to be over-engineering though.

Depending on usage, I'd go with having a shard up to dozen or 2 of GBs.

That's an ideal size from my experience.

That seems useful. Do you base that recommendation around ease of moving
the shards around?

By that logic the vast majority of the indexes I have now should be
squished to one shard. None of them grow that quickly and re-sharding
them later really isn't a problem.

If you want to go the virtual shard route nevertheless, try using a

shard-key that will leave you with empty shards now and that you can bias
later (for example - by using dates).

I assume you mean custom routing. I can't really use that as there aren't
any good route keys. None of the candidates break data into small enough
buckets.

In the case that you mean indexing documents based on time or some other
key ala logstash, that kind of works for me and I already do it. I do it
because it improves the output of the suggesters because each index has
very different term frequencies. I don't get to delete any of the indexes
over time, though, because they aren't time based.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3CCcc4UUUe%2Bog-EMqh6DRi_DD%2BDLDiGg2udNDX2BTz1w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

inline

On Fri, Dec 13, 2013 at 12:20 AM, Nikolas Everett nik9000@gmail.com wrote:

On Thu, Dec 12, 2013 at 4:26 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Why are you sharding on the first place? unless you know it to grow very
big for sure, I'd skip this all together. Planning in advance to avoid
possible resharding later can easily become over-engineering.

I'll take that as a yes. I'm not sure if I'd consider accepting the
defaults to be over-engineering though.

I've heard Simon saying more than once that some of those defaults are
actually pretty bad :slight_smile:

Depending on usage, I'd go with having a shard up to dozen or 2 of GBs.

That's an ideal size from my experience.

That seems useful. Do you base that recommendation around ease of moving
the shards around?

Mostly around what index size makes sense in the Lucene level, but
obviously that depends on your analyzers etc. Moving shards around starts
getting too costly to do at around 15GB. We've had a cluster that took days
to stabilize when it got unstable because of larger indexes.

By that logic the vast majority of the indexes I have now should be
squished to one shard. None of them grow that quickly and re-sharding
them later really isn't a problem.

If you want to go the virtual shard route nevertheless, try using a

shard-key that will leave you with empty shards now and that you can bias
later (for example - by using dates).

I assume you mean custom routing. I can't really use that as there aren't
any good route keys. None of the candidates break data into small enough
buckets.

In the case that you mean indexing documents based on time or some other
key ala logstash, that kind of works for me and I already do it. I do it
because it improves the output of the suggesters because each index has
very different term frequencies. I don't get to delete any of the indexes
over time, though, because they aren't time based.

I meant routing. Unless you really handle lots of data, time sliced
indexing is an alternative to sharding in most cases. Combining both
(instead of using aliases, for example) only makes sense if you have to use
some predefined slice size (like daily) and you have tons of data coming in.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3CCcc4UUUe%2Bog-EMqh6DRi_DD%2BDLDiGg2udNDX2BTz1w%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZujjpOjSxwb%2B9miwKw6Sic-OuHMMW4sXvrG80L24UK59w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

inline

On Thu, Dec 12, 2013 at 5:30 PM, Itamar Syn-Hershko itamar@code972.comwrote:

On Fri, Dec 13, 2013 at 12:20 AM, Nikolas Everett nik9000@gmail.comwrote:

On Thu, Dec 12, 2013 at 4:26 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Depending on usage, I'd go with having a shard up to dozen or 2 of GBs.

That's an ideal size from my experience.

That seems useful. Do you base that recommendation around ease of moving
the shards around?

Mostly around what index size makes sense in the Lucene level, but
obviously that depends on your analyzers etc. Moving shards around starts
getting too costly to do at around 15GB. We've had a cluster that took days
to stabilize when it got unstable because of larger indexes.

Neat! Well, most wikis are getting squished to one shard then. What kind
of thing about my choice of analyzer would suggest needing larger or
smaller shards?

Do you have a sense as to how many shards is too many? I assume search and
indexing would speed up until you get to more shards than you have nodes.
After that increasing shards would slow down search and indexing but keep
the shards to a more manageable size. I remember reading about people
complaining about performance when their index spanned thousands of shards
and they wanted to search them all. I won't get that big, but could
imagine fifty or a hundred.

Thanks!

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0Sd3cx2pZ%2B%2B%3DwqASRc_E%2BxmFZL3kqX0Zrb7HnRG06VVA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

inline

On Fri, Dec 13, 2013 at 7:26 PM, Nikolas Everett nik9000@gmail.com wrote:

Mostly around what index size makes sense in the Lucene level, but
obviously that depends on your analyzers etc. Moving shards around starts
getting too costly to do at around 15GB. We've had a cluster that took days
to stabilize when it got unstable because of larger indexes.

Neat! Well, most wikis are getting squished to one shard then. What kind
of thing about my choice of analyzer would suggest needing larger or
smaller shards?

We use custom analyzers with stemming which double the amount of terms
(multiple terms on the same position), and some even more than that. Think
SynonymFilter and similar. Or if you index the same data multiple times.
This bloats the index.

Do you have a sense as to how many shards is too many? I assume search
and indexing would speed up until you get to more shards than you have
nodes. After that increasing shards would slow down search and indexing
but keep the shards to a more manageable size. I remember reading about
people complaining about performance when their index spanned thousands of
shards and they wanted to search them all. I won't get that big, but could
imagine fifty or a hundred.

Not really, and I believe this is tightly coupled with your data and
expected set of queries. The ultimate idea behind sharding is to be able to
make it so a query can execute on some of the shards to get full results,
because the other shards return 0 results. The sharding key / function
would have to try and do that.

FWIW I have indexed the English wikipedia on my MacBook Air and it takes
about 28GB. I can probably reduced more since I didn't use stemming nor
ASCIIFoldingFilter. Its perfectly fine to have this as one shard assuming
the growth rate is moderate. Just to be on the safe side you can shard it
to 3 shards. To the best of my knowledge thats the largest Wiki there is,
so thats your worst case scenario.

Thanks!

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0Sd3cx2pZ%2B%2B%3DwqASRc_E%2BxmFZL3kqX0Zrb7HnRG06VVA%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zte4Yn7SOjvOpSw2aZ3RHJjdiZnbysqqXO%2BVL0sJwUp5g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.