Setting up Elastic search on EC2 - size and number?

Hey,

I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.

Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?

Cheers,
Tim

I think even two should be enough. Since you have a single client indexing,
the question is how you can parallelize it (even if its on a single process,
consider using threads). I have a feeling that you might bottleneck on the
client side before you bottleneck on elasticsearch side. If you see that you
client can push more than elasticsearch can handle, then it make sense to
add another machine.

If you are using a large instance, make sure that you set the -Xmx parameter
to a higher value (by default it is -Xmx1g) so elasticsearch will make sure
of more memory available on the machine.

-shay.banon

On Fri, Mar 26, 2010 at 5:09 PM, timrobertson100
timrobertson100@gmail.comwrote:

Hey,

I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.

Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?

Cheers,
Tim

Thanks Shay, I'll try 2 and see the performance.

I expect for any decent response times on search I will need to get the
indexes in memory so will expect more are needed later.

On Fri, Mar 26, 2010 at 3:55 PM, Shay Banon shay.banon@elasticsearch.comwrote:

I think even two should be enough. Since you have a single client indexing,
the question is how you can parallelize it (even if its on a single process,
consider using threads). I have a feeling that you might bottleneck on the
client side before you bottleneck on elasticsearch side. If you see that you
client can push more than elasticsearch can handle, then it make sense to
add another machine.

If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.

-shay.banon

On Fri, Mar 26, 2010 at 5:09 PM, timrobertson100 <
timrobertson100@gmail.com> wrote:

Hey,

I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.

Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?

Cheers,
Tim

Do you mean storing the index in memory? It really depends on the FS
performance of amazon, I guess, but on local disks (not virtualized) you
will be surprised at the performance. If you get to compare it, it will be
interesting to hear...

-shay.banon

On Fri, Mar 26, 2010 at 8:06 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay, I'll try 2 and see the performance.

I expect for any decent response times on search I will need to get the
indexes in memory so will expect more are needed later.

On Fri, Mar 26, 2010 at 3:55 PM, Shay Banon shay.banon@elasticsearch.comwrote:

I think even two should be enough. Since you have a single client
indexing, the question is how you can parallelize it (even if its on a
single process, consider using threads). I have a feeling that you might
bottleneck on the client side before you bottleneck on elasticsearch side.
If you see that you client can push more than elasticsearch can handle, then
it make sense to add another machine.

If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.

-shay.banon

On Fri, Mar 26, 2010 at 5:09 PM, timrobertson100 <
timrobertson100@gmail.com> wrote:

Hey,

I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.

Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?

Cheers,
Tim

Hi Tim,

I am very interested to learn how your experiment went/is going. I'm leading
the development of an internal middleware solution which must work both in a
traditional hosted environment and AWS/EC2. Being able to run Elastic Search
on EC2 will help my tech selection efforts.

Any info would be great.

Many thanks,

Paul.

On Fri, Mar 26, 2010 at 6:09 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you mean storing the index in memory? It really depends on the FS
performance of amazon, I guess, but on local disks (not virtualized) you
will be surprised at the performance. If you get to compare it, it will be
interesting to hear...

-shay.banon

On Fri, Mar 26, 2010 at 8:06 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay, I'll try 2 and see the performance.

I expect for any decent response times on search I will need to get the
indexes in memory so will expect more are needed later.

On Fri, Mar 26, 2010 at 3:55 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

I think even two should be enough. Since you have a single client
indexing, the question is how you can parallelize it (even if its on a
single process, consider using threads). I have a feeling that you might
bottleneck on the client side before you bottleneck on elasticsearch side.
If you see that you client can push more than elasticsearch can handle, then
it make sense to add another machine.

If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.

-shay.banon

On Fri, Mar 26, 2010 at 5:09 PM, timrobertson100 <
timrobertson100@gmail.com> wrote:

Hey,

I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.

Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?

Cheers,
Tim

--

Paul Loy
paul@keteracel.com
http://www.keteracel.com/paul

Paul Loy wrote:

I am very interested to learn how your experiment went/is going. I'm
leading the development of an internal middleware solution which must
work both in a traditional hosted environment and AWS/EC2. Being able to
run Elastic Search on EC2 will help my tech selection efforts.

Any info would be great.

+1 :slight_smile:

Thanks,
Paolo

The short answer is I got sidetracked... it is on the list of things
to do and I will share all findings.
For sure it will work, I am just curious how $ it becomes for decent throughput.

On Tue, Mar 30, 2010 at 3:04 PM, Paolo Castagna
castagna.lists@googlemail.com wrote:

Paul Loy wrote:

I am very interested to learn how your experiment went/is going. I'm
leading the development of an internal middleware solution which must work
both in a traditional hosted environment and AWS/EC2. Being able to run
Elastic Search on EC2 will help my tech selection efforts.

Any info would be great.

+1 :slight_smile:

Thanks,
Paolo