Bulk index 70 million records


(eunever32) #1

Hi

I can index 70m small (1k) records in 40 minutes.

Would that performance be good/bad?

Configuration is 6 x Elasticsearch nodes each with 16GB dedicated memory.
Each node is 8 processor intel linux server

There are 6 clients running locally on each node (localhost) each running
elasticsearch-py helper.bulk in turn spawning 8 client processes (48
processes total).
The index.store.type is memory
refresh_interval 120s
threadpool.bulk.queue_size is 200

Marvel reports up to 80,000 records per second index rate.
But in practice the net records per second taking the 40minutes is more
like 30,000 records/s

Given the hardware my question is: is this good or should I expect faster?
And what can be done to increase through-put?

Throwing more clients at the server does seem to drive up performance...
but how to measure what is the bottleneck?

Should I be concerned that the IOps reported by marvel on the cluster
summary is
1: 344
2: 466
3: 246
4: 261
5: 162
6: 93

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

There is not much space to improve.

Note, with index.store.type memory, you index just into main memory. In
this case, ramp up as much RAM as you can.

You might also improve by

  • setting refresh interval to -1 while bulk indexing (disable refresh)
  • setting shards to 6 (or 12, 18, 24, ...) to align with the node count
  • setting replica to 0 while bulk indexing
  • increase index buffer ratio indices.memory.index_buffer_size (default 10%)
  • increase throttle rate in the index store module (default is 20mb)
  • streamline segment merging (but that does not have a noticeable effect on
    index.store.type memory)
  • index from remote servers, using extra hardware outside the ES cluster
    for building the JSON docs
  • lock ES JVM process into RAM by mlockall to avoid paging
  • GC settings

By using remote bulk clients, you can check your network bandwidth if it is
saturated by the clients.

Jörg

On Wed, Mar 5, 2014 at 9:20 PM, eunever32@gmail.com wrote:

Hi

I can index 70m small (1k) records in 40 minutes.

Would that performance be good/bad?

Configuration is 6 x Elasticsearch nodes each with 16GB dedicated memory.
Each node is 8 processor intel linux server

There are 6 clients running locally on each node (localhost) each running
elasticsearch-py helper.bulk in turn spawning 8 client processes (48
processes total).
The index.store.type is memory
refresh_interval 120s
threadpool.bulk.queue_size is 200

Marvel reports up to 80,000 records per second index rate.
But in practice the net records per second taking the 40minutes is more
like 30,000 records/s

Given the hardware my question is: is this good or should I expect faster?
And what can be done to increase through-put?

Throwing more clients at the server does seem to drive up performance...
but how to measure what is the bottleneck?

Should I be concerned that the IOps reported by marvel on the cluster
summary is
1: 344
2: 466
3: 246
4: 261
5: 162
6: 93

Thanks.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/bce7bc57-a5ce-4224-bf28-4791cacf12de%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHYVtBJyPVjXpNQ4%3D5pnuzU_vNa1a1%2B6-ct6jYqFSgp4g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3