Improve indexing throughput

kimchy · March 5, 2012, 8:31pm

Note that the merge factor parameter does not apply to the default tiered merge policy. In any case, setting it to 1 is not recommended, since you can always control the number of shards it will optimize down to in the optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster to index with 0 replicas if you can, then up the replica number after indexing, than it is to index with 1 or more replicas. We set refresh_interval to 30s and merge.policy.merge_factor to 30. After indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our own multi-threaded indexing tool to do the work. We enable _source and compression on _source. We still have _all enabled though we are not using it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaarts@gmail.com (mailto:harmaarts@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the cores for number crunching without I/O involved). Currently we are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16). Important here is that you wait for the response from ES because otherwise you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com (http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939

Topic		Replies	Views
Very slow ElasticSearch Index Elasticsearch	8	409	July 6, 2017
Slow Indexing Speed Elasticsearch	5	7252	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2494	July 5, 2017
Inserts get slower when index become large Elasticsearch	10	487	July 6, 2017
Heavy indexing cause severe delay for searching Elasticsearch	12	540	July 6, 2017

Improve indexing throughput

Related topics