I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.
Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?
I think even two should be enough. Since you have a single client indexing,
the question is how you can parallelize it (even if its on a single process,
consider using threads). I have a feeling that you might bottleneck on the
client side before you bottleneck on elasticsearch side. If you see that you
client can push more than elasticsearch can handle, then it make sense to
add another machine.
If you are using a large instance, make sure that you set the -Xmx parameter
to a higher value (by default it is -Xmx1g) so elasticsearch will make sure
of more memory available on the machine.
I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.
Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?
I think even two should be enough. Since you have a single client indexing,
the question is how you can parallelize it (even if its on a single process,
consider using threads). I have a feeling that you might bottleneck on the
client side before you bottleneck on elasticsearch side. If you see that you
client can push more than elasticsearch can handle, then it make sense to
add another machine.
If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.
I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.
Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?
Do you mean storing the index in memory? It really depends on the FS
performance of amazon, I guess, but on local disks (not virtualized) you
will be surprised at the performance. If you get to compare it, it will be
interesting to hear...
I think even two should be enough. Since you have a single client
indexing, the question is how you can parallelize it (even if its on a
single process, consider using threads). I have a feeling that you might
bottleneck on the client side before you bottleneck on elasticsearch side.
If you see that you client can push more than elasticsearch can handle, then
it make sense to add another machine.
If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.
I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.
Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?
I am very interested to learn how your experiment went/is going. I'm leading
the development of an internal middleware solution which must work both in a
traditional hosted environment and AWS/EC2. Being able to run Elastic Search
on EC2 will help my tech selection efforts.
Do you mean storing the index in memory? It really depends on the FS
performance of amazon, I guess, but on local disks (not virtualized) you
will be surprised at the performance. If you get to compare it, it will be
interesting to hear...
I think even two should be enough. Since you have a single client
indexing, the question is how you can parallelize it (even if its on a
single process, consider using threads). I have a feeling that you might
bottleneck on the client side before you bottleneck on elasticsearch side.
If you see that you client can push more than elasticsearch can handle, then
it make sense to add another machine.
If you are using a large instance, make sure that you set the -Xmx
parameter to a higher value (by default it is -Xmx1g) so elasticsearch will
make sure of more memory available on the machine.
I am about to index about 200 million records from a tab delimited
file of 23-40 properties per line (most of them indexed). Probably the
data will be 150GB in JSON.
Before I start, does anyone have a feel for what instance types and
how many they'd guess at (single client throughput only right now).
Would 3 large instances (7.5GB memory) do me or would I be better with
a bunch of smaller ones?
I am very interested to learn how your experiment went/is going. I'm
leading the development of an internal middleware solution which must
work both in a traditional hosted environment and AWS/EC2. Being able to
run Elastic Search on EC2 will help my tech selection efforts.
The short answer is I got sidetracked... it is on the list of things
to do and I will share all findings.
For sure it will work, I am just curious how $ it becomes for decent throughput.
I am very interested to learn how your experiment went/is going. I'm
leading the development of an internal middleware solution which must work
both in a traditional hosted environment and AWS/EC2. Being able to run
Elastic Search on EC2 will help my tech selection efforts.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.