Bulk index 70 million records

eunever32 · March 5, 2014, 8:20pm

Hi

I can index 70m small (1k) records in 40 minutes.

Would that performance be good/bad?

Configuration is 6 x Elasticsearch nodes each with 16GB dedicated memory.
Each node is 8 processor intel linux server

There are 6 clients running locally on each node (localhost) each running
elasticsearch-py helper.bulk in turn spawning 8 client processes (48
processes total).
The index.store.type is memory
refresh_interval 120s
threadpool.bulk.queue_size is 200

Marvel reports up to 80,000 records per second index rate.
But in practice the net records per second taking the 40minutes is more
like 30,000 records/s

Given the hardware my question is: is this good or should I expect faster?
And what can be done to increase through-put?

Throwing more clients at the server does seem to drive up performance...
but how to measure what is the bottleneck?

Should I be concerned that the IOps reported by marvel on the cluster
summary is
1: 344
2: 466
3: 246
4: 261
5: 162
6: 93

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · March 6, 2014, 9:53am

There is not much space to improve.

Note, with index.store.type memory, you index just into main memory. In
this case, ramp up as much RAM as you can.

You might also improve by

setting refresh interval to -1 while bulk indexing (disable refresh)
setting shards to 6 (or 12, 18, 24, ...) to align with the node count
setting replica to 0 while bulk indexing
increase index buffer ratio indices.memory.index_buffer_size (default 10%)
increase throttle rate in the index store module (default is 20mb)
streamline segment merging (but that does not have a noticeable effect on
index.store.type memory)
index from remote servers, using extra hardware outside the ES cluster
for building the JSON docs
lock ES JVM process into RAM by mlockall to avoid paging
GC settings

By using remote bulk clients, you can check your network bandwidth if it is
saturated by the clients.

Jörg

On Wed, Mar 5, 2014 at 9:20 PM, eunever32@gmail.com wrote:

Hi

I can index 70m small (1k) records in 40 minutes.

Would that performance be good/bad?

Configuration is 6 x Elasticsearch nodes each with 16GB dedicated memory.
Each node is 8 processor intel linux server

There are 6 clients running locally on each node (localhost) each running
elasticsearch-py helper.bulk in turn spawning 8 client processes (48
processes total).
The index.store.type is memory
refresh_interval 120s
threadpool.bulk.queue_size is 200

Marvel reports up to 80,000 records per second index rate.
But in practice the net records per second taking the 40minutes is more
like 30,000 records/s

Given the hardware my question is: is this good or should I expect faster?
And what can be done to increase through-put?

Throwing more clients at the server does seem to drive up performance...
but how to measure what is the bottleneck?

Should I be concerned that the IOps reported by marvel on the cluster
summary is
1: 344
2: 466
3: 246
4: 261
5: 162
6: 93

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHYVtBJyPVjXpNQ4%3D5pnuzU_vNa1a1%2B6-ct6jYqFSgp4g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Bulk indexing creates a lot of disk read OPS Elasticsearch	12	2548	July 6, 2017
Suggestion needed on Indexing Performance Elasticsearch	1	496	July 6, 2017
ES write performance Elasticsearch	34	3188	July 6, 2017
Rapidly Degrading Bulk Indexing Performance Elasticsearch	7	368	July 6, 2017
Slow bulk indexing Elasticsearch	4	2081	July 5, 2017

Bulk index 70 million records

Related topics