Thanks again for clarifying this, I think I understand this, what I was
referring to in my prior posts was the difference between setting 1000
documents vs 10000 documents, I was thinking the bigger the chunk volume
will produce less over the wire index requests, but I understand your
reasoning behind thrashing and slow GC. The numbers below "kind of" support
my theory, as I increased the chunk to 10 MB or 10,000 docs, I saw a slight
improvement in total indexing time (I think).
I would like to get your/others feedback on some numbers/benchmarks, I
tested with bulkrequest and with bulkprocessor, both similar results (I
seem to think it is slow?)
- Same source for testing (85 MB)
- Running one node/1 shard/ 0 replica on local mac book 8 cores, 4G RAM
- Used Bulk batch size 1MB & concurrentRequests = 1, I indexed 85 MB in
~17 seconds.
- Used Bulk batch size 1MB & concurrentRequests = 8, I indexed 85 MB in
~15 seconds.
- Used Bulk batch size 5MB & concurrentRequests = 1, I indexed 85 MB in
~15 seconds.
- Used Bulk batch size 5MB & concurrentRequests = 8, I indexed 85 MB in
~17 seconds.
- Used Bulk batch size 10MB & concurrentRequests = 1, I indexed 85 MB in
~13 seconds.
- Used Bulk batch size 10MB & concurrentRequests = 8, I indexed 85 MB in
~13 seconds.
----------------------------- Using number of docs
- Used Bulk 1000 docs & concurrentRequests = 1, I indexed 85 MB in ~15
seconds.
- Used Bulk 1000 docs & concurrentRequests = 8, I indexed 85 MB in ~13
seconds.
- Used Bulk 10000 docs & concurrentRequests = 1, I indexed 85 MB in ~15
seconds.
- Used Bulk 10000 docs & concurrentRequests = 8, I indexed 85 MB in
~12/~13 seconds.
Ok, So an average of 15 sec for 85MB, 5.5 MB/sec. Why do I think this is
slow. I am not sure if I am doing the right math, but for 20 million docs
(27 TB data), this will take 2 days?
I understand with better machines like SSD and more RAM I will get better
results. However, I would like to optimize what I have now to the fullest
before scaling up. What other configurations can I tweak to improve for my
current test?
.put("client.transport.sniff", true)
.put("refresh_interval", "-1")
.put("number_of_shards", 1)
.put("number_of_replicas", "0")
On Monday, February 3, 2014 2:02:32 PM UTC-5, Jörg Prante wrote:
Not sure if I understand.
If I had to index a pile of documents, say 15M, I would build bulk request
of 1000 documents, where each doc is in avg ~1K so I end up at ~1MB. I
would not care about different doc size as they equal out over the total
amountThen I send this bulk request over the wire. With a threaded bulk
feeder, I can control concurrent bulk requests of up to the number of CPU
cores, say 32 cores. Then repeat. In total, I send 15K bulk requests.
The effect is that on the ES cluster, each bulk request of 1M size
allocates only few resources on the heap and the bulk request can be
processed fast. If the cluster is slow, the client sees the ongoing bulk
requests piling up before bulk responses are returned, and can control bulk
capacity against a maximum concurrency limit. If the cluster is fast, the
client receives responses almost instantly, and the client can decide if it
is more appropriate to increase bulk request size or concurrency.
Does it make sense?
Jörg
On Mon, Feb 3, 2014 at 5:06 PM, ZenMaster80 <sabda...@gmail.com<javascript:>
wrote:
Jörg,
Just so I understand this, if I were to index 100 MB worth of data total
with chunk volumes of 5 MB each, this means I have to index 20 times.If I
were to set the bulk size to 20 MB, I will have to index 5 times.
This is a small data size, picture I have millions of documents. Are you
saying the first method is better because of GC operations would be faster?
Thanks again
On Monday, February 3, 2014 9:47:46 AM UTC-5, Jörg Prante wrote:
Note, bulk operates just on network transport level, not on index level
(there are no transactions or chunks). Bulk saves network roundtrips, while
the execution of index operations is essentially the same as if you
transferred the operations one by one.
To change refresh interval to -1, use an update settings request like
this:
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/indices-update-settings.html
ImmutableSettings.Builder settingsBuilder = ImmutableSettings.
settingsBuilder();
settingsBuilder.put("refresh_interval", "-1"));
UpdateSettingsRequest updateSettingsRequest = new
UpdateSettingsRequest(myIndexName)
.settings(settingsBuilder);
client.admin().indices()
.updateSettings(updateSettingsRequest)
.actionGet();
Jörg
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/531710e5-e42a-4ed1-a1e0-ad5d48e14146%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00ee9c55-05a3-492e-b497-1dccc772e90e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.