Calculate optimal number of nodes

Hi,

I'm new to elasticsearch and try to understand the basics. I try to
calculate the correct number of nodes for a specific index size. Let's
start with an example:

The index size is 100 GB, number of replicas=2, number of shards does
not matter (I guess), so for optimal performance, there should be
100GB + 2x100GB = 300GB in memory.

If my servers have 32GB of ram, I would need 10 of those = rougly 30GB
for elasticsearch and 2GB for the operating system. Is this correct?

Half a year later, the index grows to 200GB, my options are either
a) to add another 10 servers with 32GB of ram or
b) replace the old 32GB-ram servers with 10 new 64GB ram servers.

From what I've read so far I can't mix servers with different speeds
and different memory sizes, the slowest one is always the bottleneck.

Thanks,

Jean

The full index is not stored in memory, so you don't need to have memory based on the size of the index. The memory usage is mainly driven by the interval terms loaded to memory (sort of a skip list to make searches faster, defaults to every 128 term), and if you do sorting / faceting on fields (that part you can tell using the node stats API).

On Tuesday, February 7, 2012 at 12:21 PM, jeangld@yahoo.com wrote:

Hi,

I'm new to elasticsearch and try to understand the basics. I try to
calculate the correct number of nodes for a specific index size. Let's
start with an example:

The index size is 100 GB, number of replicas=2, number of shards does
not matter (I guess), so for optimal performance, there should be
100GB + 2x100GB = 300GB in memory.

If my servers have 32GB of ram, I would need 10 of those = rougly 30GB
for elasticsearch and 2GB for the operating system. Is this correct?

Half a year later, the index grows to 200GB, my options are either
a) to add another 10 servers with 32GB of ram or
b) replace the old 32GB-ram servers with 10 new 64GB ram servers.

From what I've read so far I can't mix servers with different speeds
and different memory sizes, the slowest one is always the bottleneck.

Thanks,

Jean

Thanks for the fast reply, which number should I look at from the node
stats? I always thought it is best to have the full index (without the
data file *.fdt) in memory.