Estimates of cluster size / hardware based on documents and size

Hi there,

Does anyone here have estimates of the cluster size based on the amount of
documents of a certain size?

For example:

  • 50.000.000 x 1024 bytes document (~ 47GB) requires X servers with X
    hardware)
  • 500.000.000 x 500 bytes document (~ 238GB) requires X servers with X
    hardware)

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

It's very difficult to provide any estimates based on the number and size
of indexed documents. For example, on systems with heavy search traffic,
peak frequency of search requests and required latency might be the most
important factors determining required number of nodes in the cluster. It's
also important to consider what type of queries will be used because
different queries have very different memory requirements. The type of data
indexed might also have significant impact on the index size. There are
just too many factors that can have significant impact on memory, disk and
CPU requirements to give reasonable estimates based on information provided.

I am sure this is not the answer that you hoped to receive, but I would
suggest taking a significant subset of your data and trying to index and
search it using requests and load similar to what you anticipate to get in
production while watching index size, memory and CPU. Start with a couple
of small nodes and load them until you reach breaking point then scale your
cluster accordingly. Relying on any other estimates might be very
misleading.

On Saturday, November 24, 2012 10:34:03 AM UTC-5, Robin Verlangen wrote:

Hi there,

Does anyone here have estimates of the cluster size based on the amount of
documents of a certain size?

For example:

  • 50.000.000 x 1024 bytes document (~ 47GB) requires X servers with X
    hardware)
  • 500.000.000 x 500 bytes document (~ 238GB) requires X servers with X
    hardware)

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl <javascript:>

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

Hi Igor,

Thank you for your response. I actually did expect an answer like this.
However I hoped that document size + count would give a rough estimate.
However of course the amounts of searches are important, queries etc. All
makes sense.

I'll just go with testing it, that's the only reliable manner.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Sun, Nov 25, 2012 at 1:34 AM, Igor Motov imotov@gmail.com wrote:

It's very difficult to provide any estimates based on the number and size
of indexed documents. For example, on systems with heavy search traffic,
peak frequency of search requests and required latency might be the most
important factors determining required number of nodes in the cluster. It's
also important to consider what type of queries will be used because
different queries have very different memory requirements. The type of data
indexed might also have significant impact on the index size. There are
just too many factors that can have significant impact on memory, disk and
CPU requirements to give reasonable estimates based on information provided.

I am sure this is not the answer that you hoped to receive, but I would
suggest taking a significant subset of your data and trying to index and
search it using requests and load similar to what you anticipate to get in
production while watching index size, memory and CPU. Start with a couple
of small nodes and load them until you reach breaking point then scale your
cluster accordingly. Relying on any other estimates might be very
misleading.

On Saturday, November 24, 2012 10:34:03 AM UTC-5, Robin Verlangen wrote:

Hi there,

Does anyone here have estimates of the cluster size based on the amount
of documents of a certain size?

For example:

  • 50.000.000 x 1024 bytes document (~ 47GB) requires X servers with X
    hardware)
  • 500.000.000 x 500 bytes document (~ 238GB) requires X servers with X
    hardware)

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--

It would be great if you could post your benchmarking results and dataset
information.
If more people will do it we will have different benchmarks for different
use cases.

On Sunday, November 25, 2012 8:55:42 PM UTC+2, Robin Verlangen wrote:

Hi Igor,

Thank you for your response. I actually did expect an answer like this.
However I hoped that document size + count would give a rough estimate.
However of course the amounts of searches are important, queries etc. All
makes sense.

I'll just go with testing it, that's the only reliable manner.

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl <javascript:>

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Sun, Nov 25, 2012 at 1:34 AM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

It's very difficult to provide any estimates based on the number and size
of indexed documents. For example, on systems with heavy search traffic,
peak frequency of search requests and required latency might be the most
important factors determining required number of nodes in the cluster. It's
also important to consider what type of queries will be used because
different queries have very different memory requirements. The type of data
indexed might also have significant impact on the index size. There are
just too many factors that can have significant impact on memory, disk and
CPU requirements to give reasonable estimates based on information provided.

I am sure this is not the answer that you hoped to receive, but I would
suggest taking a significant subset of your data and trying to index and
search it using requests and load similar to what you anticipate to get in
production while watching index size, memory and CPU. Start with a couple
of small nodes and load them until you reach breaking point then scale your
cluster accordingly. Relying on any other estimates might be very
misleading.

On Saturday, November 24, 2012 10:34:03 AM UTC-5, Robin Verlangen wrote:

Hi there,

Does anyone here have estimates of the cluster size based on the amount
of documents of a certain size?

For example:

  • 50.000.000 x 1024 bytes document (~ 47GB) requires X servers with X
    hardware)
  • 500.000.000 x 500 bytes document (~ 238GB) requires X servers with X
    hardware)

Best regards,

Robin Verlangen
Software engineer
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

--

--