Large SOLR migration recommendations (3.6 Billion+ docs)

We are considering migrating from Solr to ES and are planning our
server/shard setup now. I thought that perhaps the group has some
recommendations based on our current Solr environment. Currently we have:

4xDedicated Solr Machines with 8-core CPU and 96GB RAM and 8x500GB SSD
using RAID5
Each solr server has 10 cores (indexes) and each of those 10 cores is
sharded across the 4 servers
Each of those individual cores currently on average occupies 200GB of disk
space and has approximately 90 million documents.

This brings our total index space to 8TB and 3.6 billion documents. We
index about 10 million new documents each day.

So my question is, if designing a new ES deployment architecture what would
be the suggested number of servers and number of shards? This is a
multi-tenant environment so is it better to separate the tenants (2,000+)
by index or by document type? Also, the solr index is currently NOT storing
the document content, but we would like ES to so our index size will
increase substantially.

Thank you in advance for your time.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So my question is, if designing a new ES deployment architecture what
would be the suggested number of servers and number of shards?

I'm afraid nobody can't give you any specific numbers with real confidence
-- the infrastructure depends not only on the number of documents or the
index size, but also on query type (sorting, how much? faceting, how much?
heavy queries?), query throughput, your desired response latency, etc.

The usual advice to get some feeling and understanding of the requirements
is:

1/ Create some shielded, controlled, limited environment (CPU-wise,
IO-wise, RAM-wise), either VM or a machine
2/ Create an index with one single shard, and pour some documents based
on your real data into it
3/ Fire some example queries, based on your projected usage
4/ Continue pouring documents into this index, while performing the
queries, and stop when you feel or know it's too much for the single
shard on this infrastructure to handle.

Now, you have an approximate number of docs/size one single shard can
handle. Now you have to estimate how much of these shards a real machine
can handle. From that point on, everything is simply a matter of
multiplication :slight_smile:

This is a multi-tenant environment so is it better to separate the
tenants (2,000+) by index or by document type?

If it makes sense for you to query a specific portion of the corpus, based
on "account ID", definitely have a look into the "routing" and "index
aliases" feature. Shay talks about some concrete strategies in this
presentation: http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

Also, the solr index is currently NOT storing the document content, but
we would like ES to so our index size will increase substantially.

To save disk space when you'd like store and server the documents itself,
consider using the compress option
(http://www.elasticsearch.org/guide/reference/mapping/source-field.html)

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you for your response.

Routing looks very good for our situation, yes. We can effectively limit
each query to a single shard quite easily in this manner. Given this, query
performance will not be affected by the number of shards directly as no
query will span multiple shards. However, there has to be some overhead for
each additional shard. Is there some idea of what this overhead is? Would
it be reasonable to have 100 shards? 1000 shards?

I setup a simple 3 node cluster and initialized a new index with 1000
shards and 1 replica and it didn't take long to go green.

On Tuesday, February 5, 2013 11:23:42 PM UTC-8, Karel Minařík wrote:

So my question is, if designing a new ES deployment architecture what
would be the suggested number of servers and number of shards?

I'm afraid nobody can't give you any specific numbers with real confidence
-- the infrastructure depends not only on the number of documents or the
index size, but also on query type (sorting, how much? faceting, how much?
heavy queries?), query throughput, your desired response latency, etc.

The usual advice to get some feeling and understanding of the requirements
is:

1/ Create some shielded, controlled, limited environment (CPU-wise,
IO-wise, RAM-wise), either VM or a machine
2/ Create an index with one single shard, and pour some documents based
on your real data into it
3/ Fire some example queries, based on your projected usage
4/ Continue pouring documents into this index, while performing the
queries, and stop when you feel or know it's too much for the single
shard on this infrastructure to handle.

Now, you have an approximate number of docs/size one single shard can
handle. Now you have to estimate how much of these shards a real machine
can handle. From that point on, everything is simply a matter of
multiplication :slight_smile:

This is a multi-tenant environment so is it better to separate the
tenants (2,000+) by index or by document type?

If it makes sense for you to query a specific portion of the corpus, based
on "account ID", definitely have a look into the "routing" and "index
aliases" feature. Shay talks about some concrete strategies in this
presentation:
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

Also, the solr index is currently NOT storing the document content, but
we would like ES to so our index size will increase substantially.

To save disk space when you'd like store and server the documents itself,
consider using the compress option (
http://www.elasticsearch.org/guide/reference/mapping/source-field.html)

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

However, there has to be some overhead for each additional shard. Is there some idea of what this overhead is? Would it be reasonable to have 100 shards? 1000 shards?

Depends on how many nodes you have, to put all those shards on :slight_smile:

I setup a simple 3 node cluster and initialized a new index with 1000 shards and 1 replica and it didn't take long to go green.

That's one way how to stress-test it. But a more reasonble way is the "overload one shard" technique I have described. Because "scaling out" means just placing all those shards (= Lucene indices) somewhere. Given you're not even hiting all/many of them, it plays very well into your cards.

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I would think the requirements would be roughly the same for Solr and ES.
If you are comparing and doing performance testing, hardware capacity
planning, and such, you may find SPM for ES and Solr helpful

  • http://sematext.com/spm/index.html .
    5+ years ago I ran a site that had >300K Lucene indices on a single
    dual-core server with 8 GB of RAM. The indices would get closed when not
    in use, but crawlers were allowed to crawl the site and triggered all kinds
    of queries, so you can imagine what was happening. So I would consider
    index-per-tenant approach first. Oh, and I know one of the SPM users has a
    15-node ES cluster with close to 4000 indices and I don't know how many
    shards. I don't think their servers are better than yours.

Otis

Solr & ElasticSearch Support

On Tuesday, February 5, 2013 10:26:21 AM UTC-5, ocryan wrote:

We are considering migrating from Solr to ES and are planning our
server/shard setup now. I thought that perhaps the group has some
recommendations based on our current Solr environment. Currently we have:

4xDedicated Solr Machines with 8-core CPU and 96GB RAM and 8x500GB SSD
using RAID5
Each solr server has 10 cores (indexes) and each of those 10 cores is
sharded across the 4 servers
Each of those individual cores currently on average occupies 200GB of disk
space and has approximately 90 million documents.

This brings our total index space to 8TB and 3.6 billion documents. We
index about 10 million new documents each day.

So my question is, if designing a new ES deployment architecture what
would be the suggested number of servers and number of shards? This is a
multi-tenant environment so is it better to separate the tenants (2,000+)
by index or by document type? Also, the solr index is currently NOT storing
the document content, but we would like ES to so our index size will
increase substantially.

Thank you in advance for your time.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Index per tenant would be the easiest and most simple solution. That was my
original thought, I was just concerned with overhead of each additional
index. Then there's other logistical concerns such as simply using the
elasticsearch-head plugin, will it handle displaying 2,000+ indexes? I
think I'll start with this approach and run some tests to find out. I'll
report back here for the benefit of the group.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In addition to what was mentioned above about setting "_source" : {
"compress" : "true" }, you should also consider setting "_all" : {
"enabled" : "false" } to save even more space.

On Wednesday, February 13, 2013 9:05:40 AM UTC-5, ocryan wrote:

Index per tenant would be the easiest and most simple solution. That was
my original thought, I was just concerned with overhead of each additional
index. Then there's other logistical concerns such as simply using the
elasticsearch-head plugin, will it handle displaying 2,000+ indexes? I
think I'll start with this approach and run some tests to find out. I'll
report back here for the benefit of the group.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.