Improving Bulk Indexing

IronMike · February 2, 2014, 1:26am

I would appreciate if I can get some tips and others perspective on bulk
indexing since I am new to this.

The end goal is to index 10 to 20 million document. So, I started working
on my local machine with a sample of about 100 MB worth and used whatever
the default Elasticsearch configuration is.
First, I encountered a heap error and I believe it was due to the default
of JVM 64 MB limit.

My document sizes vary in size, some are too big and others are too small.

1- I read that others choose some number of files to "chunk", this is
unpredictable in my case, so do I do it based on data size. For instance,
right now I keeps track of the size and on my local machine, I ask to index
if total queued data is about 64MB. Should I apply same concept for
production as well, in other words if JVM is 2 GB, do I index everytime my
queued data is close to 2 GB?

2- What do others recommend for settings and hardware or any other tips
that I can use to make this optimal?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f111fc7-0e92-44ba-b2f7-ddcb98095fb1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 2, 2014, 3:23pm

What is "the default of JVM 64 MB limit"? Elasticsearch uses by default 1
GB heap, not 64 MB. Maybe you have an extra JVM with your bulk client that
uses 64 MB? This is much too few. Use 4-6 GB heap if your machine allows
that.

Note, JVM 7 of OpenJDK/Oracle, which is recommended, uses 25% of your host
RAM by default for your heap, not 64 MB.

You can use the BulkProcessor in the Java API which also has a volume
chunk limit instead of doc num, the default is 5 MB. 64 MB is a very large
bulk size. Bulk sizes of ~2GB are very bad since that will thrash all the
heap on the ES nodes and this induces severe GC problems and delays. I
recommend 1-10 MB, so each bulk responds within 1 second, and GC is very
fast. You can run bulks concurrently to increase speed. To find the sweet
spot of your client/server situation, you have to experiment with your
setup: choose 1MB and 1 concurrent thread, then 2MB and 1 concurrent
thread, 2MB / 2 threads etc. until you see rates declining. ES has some
internal settings that avoid an overrun of the whole cluster.
Most important is to set replica to 0 to make place for better
performance while bulk indexing, and disable the refresh rate of default
1sec to -1. After bulk, re-enable refresh, optimize, and add replica. There
are other more advanced knobs like throttling at store level or thread pool
or queue sizes but changing the defaults do not influence bulk performance
that much.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHJUaqonj7G50zNQ_xU6Prbw3GXayFTGwp-o11FdHr3cw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 3, 2014, 2:36pm

Jörg,

Thanks for the tips. I meant 64 MB for chunks volume not the heap size
(sorry). I thought that was normal as I was thinking bigger chunks and less
index transactions vs less chunks and more index transactions, basically I
was thinking if I index smaller chunks its going to take a lot longer for
the millions of documents. I will certainly try the tips you suggested and
let you know. I am using one node/one shard and replicas = 0 for now. I am
using native JAVA api, do you know where I can set refresh flag = -1?

Thanks

On Sunday, February 2, 2014 10:23:52 AM UTC-5, Jörg Prante wrote:

What is "the default of JVM 64 MB limit"? Elasticsearch uses by default 1
GB heap, not 64 MB. Maybe you have an extra JVM with your bulk client that
uses 64 MB? This is much too few. Use 4-6 GB heap if your machine allows
that.

Note, JVM 7 of OpenJDK/Oracle, which is recommended, uses 25% of your host
RAM by default for your heap, not 64 MB.

You can use the BulkProcessor in the Java API which also has a volume
chunk limit instead of doc num, the default is 5 MB. 64 MB is a very large
bulk size. Bulk sizes of ~2GB are very bad since that will thrash all the
heap on the ES nodes and this induces severe GC problems and delays. I
recommend 1-10 MB, so each bulk responds within 1 second, and GC is very
fast. You can run bulks concurrently to increase speed. To find the sweet
spot of your client/server situation, you have to experiment with your
setup: choose 1MB and 1 concurrent thread, then 2MB and 1 concurrent
thread, 2MB / 2 threads etc. until you see rates declining. ES has some
internal settings that avoid an overrun of the whole cluster.

Most important is to set replica to 0 to make place for better
performance while bulk indexing, and disable the refresh rate of default
1sec to -1. After bulk, re-enable refresh, optimize, and add replica. There
are other more advanced knobs like throttling at store level or thread pool
or queue sizes but changing the defaults do not influence bulk performance
that much.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a76bcb2-92d4-4999-ae92-a7d1210541ed%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 3, 2014, 2:47pm

Note, bulk operates just on network transport level, not on index level
(there are no transactions or chunks). Bulk saves network roundtrips, while
the execution of index operations is essentially the same as if you
transferred the operations one by one.

To change refresh interval to -1, use an update settings request like this:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html

    ImmutableSettings.Builder settingsBuilder =

ImmutableSettings.settingsBuilder();
settingsBuilder.put("refresh_interval", "-1"));
UpdateSettingsRequest updateSettingsRequest = new
UpdateSettingsRequest(myIndexName)
.settings(settingsBuilder);
client.admin().indices()
.updateSettings(updateSettingsRequest)
.actionGet();

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGfhEgHqxPrFU5KZ8fF%2B9-swyfoufVfzc_gBDp%3DoMaewA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 3, 2014, 4:06pm

Jörg,

Just so I understand this, if I were to index 100 MB worth of data total
with chunk volumes of 5 MB each, this means I have to index 20 times.If I
were to set the bulk size to 20 MB, I will have to index 5 times.
This is a small data size, picture I have millions of documents. Are you
saying the first method is better because of GC operations would be faster?

Thanks again

On Monday, February 3, 2014 9:47:46 AM UTC-5, Jörg Prante wrote:

Note, bulk operates just on network transport level, not on index level
(there are no transactions or chunks). Bulk saves network roundtrips, while
the execution of index operations is essentially the same as if you
transferred the operations one by one.

To change refresh interval to -1, use an update settings request like this:

Elasticsearch Platform — Find real-time answers at scale | Elastic
    ImmutableSettings.Builder settingsBuilder = 
ImmutableSettings.settingsBuilder();
settingsBuilder.put("refresh_interval", "-1"));
UpdateSettingsRequest updateSettingsRequest = new
UpdateSettingsRequest(myIndexName)
.settings(settingsBuilder);
client.admin().indices()
.updateSettings(updateSettingsRequest)
.actionGet();

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/531710e5-e42a-4ed1-a1e0-ad5d48e14146%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 3, 2014, 7:02pm

Not sure if I understand.

If I had to index a pile of documents, say 15M, I would build bulk request
of 1000 documents, where each doc is in avg ~1K so I end up at ~1MB. I
would not care about different doc size as they equal out over the total
amountThen I send this bulk request over the wire. With a threaded bulk
feeder, I can control concurrent bulk requests of up to the number of CPU
cores, say 32 cores. Then repeat. In total, I send 15K bulk requests.

The effect is that on the ES cluster, each bulk request of 1M size
allocates only few resources on the heap and the bulk request can be
processed fast. If the cluster is slow, the client sees the ongoing bulk
requests piling up before bulk responses are returned, and can control bulk
capacity against a maximum concurrency limit. If the cluster is fast, the
client receives responses almost instantly, and the client can decide if it
is more appropriate to increase bulk request size or concurrency.

Does it make sense?

Jörg

On Mon, Feb 3, 2014 at 5:06 PM, ZenMaster80 sabdalla80@gmail.com wrote:

Jörg,

Just so I understand this, if I were to index 100 MB worth of data total
with chunk volumes of 5 MB each, this means I have to index 20 times.If I
were to set the bulk size to 20 MB, I will have to index 5 times.
This is a small data size, picture I have millions of documents. Are you
saying the first method is better because of GC operations would be faster?

Thanks again

On Monday, February 3, 2014 9:47:46 AM UTC-5, Jörg Prante wrote:
Note, bulk operates just on network transport level, not on index level
(there are no transactions or chunks). Bulk saves network roundtrips, while
the execution of index operations is essentially the same as if you
transferred the operations one by one.

To change refresh interval to -1, use an update settings request like
this:

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/indices-update-settings.html
    ImmutableSettings.Builder settingsBuilder = ImmutableSettings.
settingsBuilder();
settingsBuilder.put("refresh_interval", "-1"));
UpdateSettingsRequest updateSettingsRequest = new
UpdateSettingsRequest(myIndexName)
.settings(settingsBuilder);
client.admin().indices()
.updateSettings(updateSettingsRequest)
.actionGet();

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/531710e5-e42a-4ed1-a1e0-ad5d48e14146%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF9WFcFD5pgjdjV1fM7iJhwZdf%2B4zzhYzGRKtFbhN55bA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 4, 2014, 2:40am

Thanks again for clarifying this, I think I understand this, what I was
referring to in my prior posts was the difference between setting 1000
documents vs 10000 documents, I was thinking the bigger the chunk volume
will produce less over the wire index requests, but I understand your
reasoning behind thrashing and slow GC. The numbers below "kind of" support
my theory, as I increased the chunk to 10 MB or 10,000 docs, I saw a slight
improvement in total indexing time (I think).
I would like to get your/others feedback on some numbers/benchmarks, I
tested with bulkrequest and with bulkprocessor, both similar results (I
seem to think it is slow?)

Same source for testing (85 MB)
Running one node/1 shard/ 0 replica on local mac book 8 cores, 4G RAM
Used Bulk batch size 1MB & concurrentRequests = 1, I indexed 85 MB in
~17 seconds.
Used Bulk batch size 1MB & concurrentRequests = 8, I indexed 85 MB in
~15 seconds.
Used Bulk batch size 5MB & concurrentRequests = 1, I indexed 85 MB in
~15 seconds.
Used Bulk batch size 5MB & concurrentRequests = 8, I indexed 85 MB in
~17 seconds.
Used Bulk batch size 10MB & concurrentRequests = 1, I indexed 85 MB in
~13 seconds.
Used Bulk batch size 10MB & concurrentRequests = 8, I indexed 85 MB in
~13 seconds.
----------------------------- Using number of docs

Used Bulk 1000 docs & concurrentRequests = 1, I indexed 85 MB in ~15
seconds.
Used Bulk 1000 docs & concurrentRequests = 8, I indexed 85 MB in ~13
seconds.
Used Bulk 10000 docs & concurrentRequests = 1, I indexed 85 MB in ~15
seconds.
Used Bulk 10000 docs & concurrentRequests = 8, I indexed 85 MB in
~12/~13 seconds.

Ok, So an average of 15 sec for 85MB, 5.5 MB/sec. Why do I think this is
slow. I am not sure if I am doing the right math, but for 20 million docs
(27 TB data), this will take 2 days?
I understand with better machines like SSD and more RAM I will get better
results. However, I would like to optimize what I have now to the fullest
before scaling up. What other configurations can I tweak to improve for my
current test?

.put("client.transport.sniff", true)

.put("refresh_interval", "-1")

.put("number_of_shards", 1)

.put("number_of_replicas", "0")

On Monday, February 3, 2014 2:02:32 PM UTC-5, Jörg Prante wrote:

Not sure if I understand.

If I had to index a pile of documents, say 15M, I would build bulk request
of 1000 documents, where each doc is in avg ~1K so I end up at ~1MB. I
would not care about different doc size as they equal out over the total
amountThen I send this bulk request over the wire. With a threaded bulk
feeder, I can control concurrent bulk requests of up to the number of CPU
cores, say 32 cores. Then repeat. In total, I send 15K bulk requests.

The effect is that on the ES cluster, each bulk request of 1M size
allocates only few resources on the heap and the bulk request can be
processed fast. If the cluster is slow, the client sees the ongoing bulk
requests piling up before bulk responses are returned, and can control bulk
capacity against a maximum concurrency limit. If the cluster is fast, the
client receives responses almost instantly, and the client can decide if it
is more appropriate to increase bulk request size or concurrency.

Does it make sense?

Jörg

On Mon, Feb 3, 2014 at 5:06 PM, ZenMaster80 <sabda...@gmail.com<javascript:>

wrote:
Jörg,

Just so I understand this, if I were to index 100 MB worth of data total
with chunk volumes of 5 MB each, this means I have to index 20 times.If I
were to set the bulk size to 20 MB, I will have to index 5 times.
This is a small data size, picture I have millions of documents. Are you
saying the first method is better because of GC operations would be faster?

Thanks again

On Monday, February 3, 2014 9:47:46 AM UTC-5, Jörg Prante wrote:
Note, bulk operates just on network transport level, not on index level
(there are no transactions or chunks). Bulk saves network roundtrips, while
the execution of index operations is essentially the same as if you
transferred the operations one by one.

To change refresh interval to -1, use an update settings request like
this:

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/indices-update-settings.html
    ImmutableSettings.Builder settingsBuilder = ImmutableSettings.
settingsBuilder();
settingsBuilder.put("refresh_interval", "-1"));
UpdateSettingsRequest updateSettingsRequest = new
UpdateSettingsRequest(myIndexName)
.settings(settingsBuilder);
client.admin().indices()
.updateSettings(updateSettingsRequest)
.actionGet();

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/531710e5-e42a-4ed1-a1e0-ad5d48e14146%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00ee9c55-05a3-492e-b497-1dccc772e90e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 4, 2014, 8:11am

SSD will improve overall performance very much, yes. Disk drives are the
slowest part in the chain and this will help. No more low IOPS, so it will
significantly reduce the load on CPU (less IO waits).

More RAM will not help that much. In fact, more RAM will slow down
persisting, it increases pressure on the memory-to-disk part. ES obviously
does not depend on large RAM for persisting data, some MB suffice, but you
can try and see for yourself.

85 MB is not sufficient for testing index segment merging and GC effects,
you should run a bulk indexing feed not for seconds, but for at least 20-30
minutes, if not for hours.

Also check if your mapping can be simplified, the less complex analyzers,
the faster ES can index.

You should also exercise your feed program how long it takes to process
your input without the part of bulk indexing. Then you see a bottom line,
and maybe more space for improvement outside ES.

In my use case, it helped to move the feed program to another server and
use the TransportClient with a speedup of ~30%.

I agree that 5.5M/sec is not the end of the line but that heavily depends
on your hard- and software configuration (machine, OS, file systems, JVM).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFxZgm2opO5y5JFGvgx-M4YrnNRaZ43FnBBSrC1%2BkmWjQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 4, 2014, 4:22pm

Jörg,

Great, I learned a lot about the process from your responses. Could you
elaborate more on your use case, mine I think will be similar to yours
where processing/feeding is on one server and I will use transport client,
index nodes will be on EC2. So, when I do get to setting up Ec2 nodes, I
believe I should be mostly looking for big cores and SSD.
For current test, besides running long feeds to guage performance and
checking for analyzers, I take it there isn't much else I can do to make
significant impact?

On Tuesday, February 4, 2014 3:11:14 AM UTC-5, Jörg Prante wrote:

SSD will improve overall performance very much, yes. Disk drives are the
slowest part in the chain and this will help. No more low IOPS, so it will
significantly reduce the load on CPU (less IO waits).

More RAM will not help that much. In fact, more RAM will slow down
persisting, it increases pressure on the memory-to-disk part. ES obviously
does not depend on large RAM for persisting data, some MB suffice, but you
can try and see for yourself.

85 MB is not sufficient for testing index segment merging and GC effects,
you should run a bulk indexing feed not for seconds, but for at least 20-30
minutes, if not for hours.

Also check if your mapping can be simplified, the less complex analyzers,
the faster ES can index.

You should also exercise your feed program how long it takes to process
your input without the part of bulk indexing. Then you see a bottom line,
and maybe more space for improvement outside ES.

In my use case, it helped to move the feed program to another server and
use the TransportClient with a speedup of ~30%.

I agree that 5.5M/sec is not the end of the line but that heavily depends
on your hard- and software configuration (machine, OS, file systems, JVM).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8db08c83-c91d-45df-bd28-5fe49f7f32cd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 4, 2014, 9:22pm

My use case is bibliographic data indexing of academic and public
libraries. There are ~100m records from various sources that I regularly
extract, transform into JSON-LD, and load into Elasticsearch. Some are
files, some are fetched by JDBC. I have six 32-core servers in our place,
organized in 2 ES clusters. Self installed and configured - no cloud VMs
With bulk indexing I can push around 10-12m/sec to an ES cluster.
Transforming docs is rather complex, needs re-processing of indexed data.
The job is done in a few hours so I can perform ETL every night. No SSD,
too expensive, but SAS-2 (6Gbit/sec) RAID-0 drives of ~1TB per server.

Jörg

On Tue, Feb 4, 2014 at 5:22 PM, ZenMaster80 sabdalla80@gmail.com wrote:

Jörg,

Great, I learned a lot about the process from your responses. Could you
elaborate more on your use case, mine I think will be similar to yours
where processing/feeding is on one server and I will use transport client,
index nodes will be on EC2. So, when I do get to setting up Ec2 nodes, I
believe I should be mostly looking for big cores and SSD.
For current test, besides running long feeds to guage performance and
checking for analyzers, I take it there isn't much else I can do to make
significant impact?

On Tuesday, February 4, 2014 3:11:14 AM UTC-5, Jörg Prante wrote:

SSD will improve overall performance very much, yes. Disk drives are the
slowest part in the chain and this will help. No more low IOPS, so it will
significantly reduce the load on CPU (less IO waits).

More RAM will not help that much. In fact, more RAM will slow down
persisting, it increases pressure on the memory-to-disk part. ES obviously
does not depend on large RAM for persisting data, some MB suffice, but you
can try and see for yourself.

85 MB is not sufficient for testing index segment merging and GC effects,
you should run a bulk indexing feed not for seconds, but for at least 20-30
minutes, if not for hours.

Also check if your mapping can be simplified, the less complex analyzers,
the faster ES can index.

You should also exercise your feed program how long it takes to process
your input without the part of bulk indexing. Then you see a bottom line,
and maybe more space for improvement outside ES.

In my use case, it helped to move the feed program to another server and
use the TransportClient with a speedup of ~30%.

I agree that 5.5M/sec is not the end of the line but that heavily depends
on your hard- and software configuration (machine, OS, file systems, JVM).

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8db08c83-c91d-45df-bd28-5fe49f7f32cd%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG1-JR%3D0S-oGsHxyqZcf04kqoGV19Y66vfLnEEi1C5zxA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 4, 2014, 11:57pm

Good to know, I will keep this in mind, even though I will try to go for
SSD as I personally had great success with them in the past! When you say
10-12 MB/sec, is this with doc parsing/processing or just ES index time.
For my humble test on a quadcore labtop, I am pushing 6 MB/sec with
processing and 9 MB/sec if I don't include processing time. I tried playing
with many different settings, I think this is about all its going to do
giving the machine I am running on.

On Tuesday, February 4, 2014 4:22:10 PM UTC-5, Jörg Prante wrote:

My use case is bibliographic data indexing of academic and public
libraries. There are ~100m records from various sources that I regularly
extract, transform into JSON-LD, and load into Elasticsearch. Some are
files, some are fetched by JDBC. I have six 32-core servers in our place,
organized in 2 ES clusters. Self installed and configured - no cloud VMs
With bulk indexing I can push around 10-12m/sec to an ES cluster.
Transforming docs is rather complex, needs re-processing of indexed data.
The job is done in a few hours so I can perform ETL every night. No SSD,
too expensive, but SAS-2 (6Gbit/sec) RAID-0 drives of ~1TB per server.

Jörg

On Tue, Feb 4, 2014 at 5:22 PM, ZenMaster80 <sabda...@gmail.com<javascript:>

wrote:

Jörg,

Great, I learned a lot about the process from your responses. Could you
elaborate more on your use case, mine I think will be similar to yours
where processing/feeding is on one server and I will use transport client,
index nodes will be on EC2. So, when I do get to setting up Ec2 nodes, I
believe I should be mostly looking for big cores and SSD.
For current test, besides running long feeds to guage performance and
checking for analyzers, I take it there isn't much else I can do to make
significant impact?

On Tuesday, February 4, 2014 3:11:14 AM UTC-5, Jörg Prante wrote:

SSD will improve overall performance very much, yes. Disk drives are the
slowest part in the chain and this will help. No more low IOPS, so it will
significantly reduce the load on CPU (less IO waits).

More RAM will not help that much. In fact, more RAM will slow down
persisting, it increases pressure on the memory-to-disk part. ES obviously
does not depend on large RAM for persisting data, some MB suffice, but you
can try and see for yourself.

85 MB is not sufficient for testing index segment merging and GC
effects, you should run a bulk indexing feed not for seconds, but for at
least 20-30 minutes, if not for hours.

Also check if your mapping can be simplified, the less complex
analyzers, the faster ES can index.

You should also exercise your feed program how long it takes to process
your input without the part of bulk indexing. Then you see a bottom line,
and maybe more space for improvement outside ES.

In my use case, it helped to move the feed program to another server and
use the TransportClient with a speedup of ~30%.

I agree that 5.5M/sec is not the end of the line but that heavily
depends on your hard- and software configuration (machine, OS, file
systems, JVM).

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8db08c83-c91d-45df-bd28-5fe49f7f32cd%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e2f2b04d-8b43-4641-a31a-adadfff037e6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 5, 2014, 12:15am

SSD is the best you can do for the persistence layer. I have such an ES
4xSSD RAID0 server at home, with 800 MB/sec sustained write I/O rate. My
servers for my day job are some years old when some TB in SSD costed a
fortune.

The higher the writing rate and IOPS capacity of the drives are, the more
throughput you can expect. Ramp up your monitoring tools, run bulk indexing
for an hour, and watch the segment merging - then you understand how bulk
indexing behaves. With slow drives, you will see decays in the bulk
indexing rate, with fast drives rather not.

10-12 MB/sec sustained rate includes transforming docs on a single remote
server to a cluster of 3 nodes, using some dozens of threads - I'm pretty
sure the ETL process is CPU bound, there is still network bandwidth
available.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFnauAWGi44pBAvu-YT31dt3eX-71PnBcR%3DWAP71BfW-w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Jvm Heap Size & Indexing Perfmance Problem Elasticsearch	1	461	March 11, 2020
Indexing large number of files each with a huge size Elasticsearch	3	456	July 6, 2017
Ideal heap size for large RAM machine Elasticsearch	4	1459	January 12, 2018
JVM Heap size issue. ElasticSearch stops sometimes due to this error Elasticsearch	11	1129	June 12, 2023
Error while indexing -java heap space Elasticsearch	17	887	July 6, 2017

Improving Bulk Indexing

Related topics