# of shards and # of data paths(striping)

Hi,

Has anyone tried indexing performance measurement between having more
shards and more data paths?

My question comes from my wondering if having more data paths might become
cost effective on EC2.

For example, consider two configurations(5 shards, 0 replica for
simplicity).

A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume

Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.

In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.

But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?

I appreciate your thoughts and ideas.

Also, does it make sense to have shared benchmark for ES? Or already have
one?

Thanks,
Takenori

Another thing to consider when running ES on AWS is the IO. AWS small
instances have low I/O performance, hence option B would likely not perform
well. Small instances are also more susceptible to noisy neighbor issues.
Just my 3.14 cents...

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 19, 2012 at 7:52 AM, Sato Takenori takenori.sato@gmail.comwrote:

Hi,

Has anyone tried indexing performance measurement between having more
shards and more data paths?

My question comes from my wondering if having more data paths might become
cost effective on EC2.

For example, consider two configurations(5 shards, 0 replica for
simplicity).

A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume

Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.

In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.

But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?

I appreciate your thoughts and ideas.

Also, does it make sense to have shared benchmark for ES? Or already have
one?

Thanks,
Takenori

Hi Takenori,

Yes, I think this is more or less a known things - that more EBS volumes
are better than one because you never know when one of them will "have a
bad day".
Plus small EC2 instances can have poor IO.

But all this is hard/impossible to measure precisely because you never know
what exactly you are measuring, who/what else is interfering with your
benchmarking, etc.

Otis

Performance Monitoring SaaS for Elasticsearch -

On Thursday, April 19, 2012 7:52:36 AM UTC-4, Sato Takenori wrote:

Hi,

Has anyone tried indexing performance measurement between having more
shards and more data paths?

My question comes from my wondering if having more data paths might become
cost effective on EC2.

For example, consider two configurations(5 shards, 0 replica for
simplicity).

A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume

Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.

In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.

But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?

I appreciate your thoughts and ideas.

Also, does it make sense to have shared benchmark for ES? Or already have
one?

Thanks,
Takenori

Thanks Berkay and Otis!

It may not be obvious, but Small instance sounds unstable as bad as micro.

I will start evaluating Large instance cluster(with EBS volumes) instead of
Small ones.

In the "Deploying Elasticsearch with Chef Solo",
Large instance is also recommended for testing.

Best,
Takenori

2012年4月20日金曜日 4時40分19秒 UTC+9 Otis Gospodnetic:

Hi Takenori,

Yes, I think this is more or less a known things - that more EBS volumes
are better than one because you never know when one of them will "have a
bad day".
Plus small EC2 instances can have poor IO.

But all this is hard/impossible to measure precisely because you never
know what exactly you are measuring, who/what else is interfering with your
benchmarking, etc.

Otis

Performance Monitoring SaaS for Elasticsearch -
Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, April 19, 2012 7:52:36 AM UTC-4, Sato Takenori wrote:

Hi,

Has anyone tried indexing performance measurement between having more
shards and more data paths?

My question comes from my wondering if having more data paths might
become cost effective on EC2.

For example, consider two configurations(5 shards, 0 replica for
simplicity).

A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume

Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.

In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.

But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?

I appreciate your thoughts and ideas.

Also, does it make sense to have shared benchmark for ES? Or already have
one?

Thanks,
Takenori