Has anyone tried indexing performance measurement between having more
shards and more data paths?
My question comes from my wondering if having more data paths might become
cost effective on EC2.
For example, consider two configurations(5 shards, 0 replica for
simplicity).
A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume
Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.
In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.
But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?
I appreciate your thoughts and ideas.
Also, does it make sense to have shared benchmark for ES? Or already have
one?
Another thing to consider when running ES on AWS is the IO. AWS small
instances have low I/O performance, hence option B would likely not perform
well. Small instances are also more susceptible to noisy neighbor issues.
Just my 3.14 cents...
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
Has anyone tried indexing performance measurement between having more
shards and more data paths?
My question comes from my wondering if having more data paths might become
cost effective on EC2.
For example, consider two configurations(5 shards, 0 replica for
simplicity).
A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume
Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.
In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.
But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?
I appreciate your thoughts and ideas.
Also, does it make sense to have shared benchmark for ES? Or already have
one?
Yes, I think this is more or less a known things - that more EBS volumes
are better than one because you never know when one of them will "have a
bad day".
Plus small EC2 instances can have poor IO.
But all this is hard/impossible to measure precisely because you never know
what exactly you are measuring, who/what else is interfering with your
benchmarking, etc.
Otis
Performance Monitoring SaaS for Elasticsearch -
On Thursday, April 19, 2012 7:52:36 AM UTC-4, Sato Takenori wrote:
Hi,
Has anyone tried indexing performance measurement between having more
shards and more data paths?
My question comes from my wondering if having more data paths might become
cost effective on EC2.
For example, consider two configurations(5 shards, 0 replica for
simplicity).
A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume
Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.
In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.
But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?
I appreciate your thoughts and ideas.
Also, does it make sense to have shared benchmark for ES? Or already have
one?
It may not be obvious, but Small instance sounds unstable as bad as micro.
I will start evaluating Large instance cluster(with EBS volumes) instead of
Small ones.
In the "Deploying Elasticsearch with Chef Solo",
Large instance is also recommended for testing.
Best,
Takenori
2012年4月20日金曜日 4時40分19秒 UTC+9 Otis Gospodnetic:
Hi Takenori,
Yes, I think this is more or less a known things - that more EBS volumes
are better than one because you never know when one of them will "have a
bad day".
Plus small EC2 instances can have poor IO.
But all this is hard/impossible to measure precisely because you never
know what exactly you are measuring, who/what else is interfering with your
benchmarking, etc.
On Thursday, April 19, 2012 7:52:36 AM UTC-4, Sato Takenori wrote:
Hi,
Has anyone tried indexing performance measurement between having more
shards and more data paths?
My question comes from my wondering if having more data paths might
become cost effective on EC2.
For example, consider two configurations(5 shards, 0 replica for
simplicity).
A: Standard Large instance with 5 EBS(8GB) volumes
B: 5 * Standard Small instance with 1 EBS(8GB) volume
Standard Large is exactly 4 times better and expensive as Standard Small.
But A costs less because one more EBS(4 more to be exact) costs less then
one more Standard Small instance.
In traditional RDBMS, fsync-heavy database operations benefit a LOT from
more volumes.
So it might have been clear that A outperforms B.
But ES does as few fsyncs(commits) as possible according to the video,
"Road to a Distributed Search Engine".
So, how do they perform?
I appreciate your thoughts and ideas.
Also, does it make sense to have shared benchmark for ES? Or already have
one?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.