Improve indexing throughput

Lior_Cohen_barnea · March 4, 2012, 9:05am

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

Mark_Waddle · March 5, 2012, 6:57am

Interesting. My experience so far has been that ES has equivalent or faster
indexing than Solr. Are you doing bulk indexing or incremental indexing? If
you are doing bulk indexing you should read

.

On Sunday, March 4, 2012 1:05:58 AM UTC-8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

Radu_Gheorghe1 · March 5, 2012, 8:36am

Hi,

I just had a "facepalm moment" while doing some performance tests on
cloud machines. Maybe it also applies to you:

the default memory settings for Elasticsearch are:

grep _MEM elasticsearch/bin/elasticsearch.in.sh

if [ "x$ES_MIN_MEM" = "x" ]; then
ES_MIN_MEM=256m
if [ "x$ES_MAX_MEM" = "x" ]; then
ES_MAX_MEM=1g
JAVA_OPTS="$JAVA_OPTS -Xms${ES_MIN_MEM}"
JAVA_OPTS="$JAVA_OPTS -Xmx${ES_MAX_MEM}"

This might turn out to be too little when inserting a huge amount of
logs for performance tests.

When I raised this limit (eg: 4g min, 20g max), the inserting
performance suddenly multiplied by 3.

Take a look here for some more info:

For performance, it's best to set the minimum and maximum memory to
the same value, in order to avoid using the Garbage Collector as much
as possible.

On Mar 4, 11:05 am, Lior Cohen barnea liorcohenbar...@gmail.com
wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

otisg · March 5, 2012, 11:58am

Hello Lior,

Well, Elasticsearch has real-time replication, so ES has to do more work.
Also, ES has index refreshing happening every 1 sec by default, while Solr
doesn't (or are you comparing ES and SolrCloud?).
...
I think more shards would speed things up, as would bulk API and setting
the number of replicas to 0.
You could also increase the number of concurrent indexers in order to
maximize CPU and/or disk IO.

Otis

Sematext is hiring Elasticsearch engineers World-Wide --

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

Thomas_Peuss · March 5, 2012, 2:00pm

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When you
calculate the indexing speed of Solr you need to take the "commit"-time
into account you don't need with ES. And you will see that Solr gets
"slower".

Increasing the shard count is only useful if you have enough CPU cores and
disk spindles available. You should increase the refresh_interval or
disable it alltogether and trigger it manually if you get to really big
index sizes (we have >1 billion docs in some indices while having >500
indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates so
you should consider that.

But in the end you need to benchmark with your amount of data (docs per
second, index size etc.) and different settings. If need to guarantee some
sort of SLA there is no other way around it.

CU
Thomas

haarts · March 5, 2012, 2:21pm

Hi Thomas,

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

On Monday, 5 March 2012 15:00:42 UTC+1, Thomas Peuss wrote:

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When
you calculate the indexing speed of Solr you need to take the "commit"-time
into account you don't need with ES. And you will see that Solr gets
"slower".

Increasing the shard count is only useful if you have enough CPU cores and
disk spindles available. You should increase the refresh_interval or
disable it alltogether and trigger it manually if you get to really big
index sizes (we have >1 billion docs in some indices while having >500
indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates
so you should consider that.

But in the end you need to benchmark with your amount of data (docs per
second, index size etc.) and different settings. If need to guarantee some
sort of SLA there is no other way around it.

CU
Thomas

jprante · March 5, 2012, 2:54pm

Hi,

here are some techniques I used for Elasticsearch indexing:

parallel multithreading custom remote index client (12 instances in
parallel, depends on your input data streams)
remote index client for distributing feed load and ES node load. ES
TransportClient is threadsafe, a single instance can be re-used for
parallel indexing.
sharding (12 shards / primaries plus 1x replica = 24 Lucene indexes
over 3 nodes in total)
bulk indexing, balanced by waiting for call backs, limited to max.
30 jobs with 100 docs each, each doc is ~2-200 KB, 4KB average
compressed input (reduces I/O bandwidth with large text documents)
RHEL, ES data volume mounted with noatime option, SAS-2 (6Gbit/s)
drives, RAID 5
default ES config changes, no change of refresh_interval, merge
policies etc. = no ES tuning at all

This config can index constantly ~3000 docs per second for ~2 hours to
Elasticsearch (for a total of ~20 mio. documents). Each ~20 min on
each node, load spikes happen, because Lucene merges take heavy I/O.
This could be optimized but for my purposes it suffices right now.

I am afraid comparing an ES cluster setup to a single Solr index seems
quite unfair.

Jörg

On Mar 5, 12:58 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hello Lior,

Well, Elasticsearch has real-time replication, so ES has to do more work.
Also, ES has index refreshing happening every 1 sec by default, while Solr
doesn't (or are you comparing ES and SolrCloud?).
...
I think more shards would speed things up, as would bulk API and setting
the number of replicas to 0.
You could also increase the number of concurrent indexers in order to
maximize CPU and/or disk IO.

Otis

Sematext is hiring Elasticsearch engineers World-Wide --Jobs - Sematext

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.
On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

kimchy · March 5, 2012, 3:02pm

A few more things to add to this great thread:

By default, elasticsearch also indexed all the json data into an _all field. This is very convenient, but does add to the indexing time (and index size). You can easily disable it with the all mapping, but its different compared to Solr, which doesn't do it.
It was already mentioned, but indexing process is different compared to Solr (I am not sure which SolrCloud you are using, but it has changed considerable over time). It does active replication, and there is no need to commit (ES commits periodically by itself, and has a transaction log). See this video for a bit more info: Elasticsearch Platform — Find real-time answers at scale | Elastic.
By default, ES stores the whole json so you can easily get it back if you issue a GET or a search. If you don't store anything in your solr mappings, then it will incur that additional overhead comparing the two.
When indexing data, if your client ends up doing batch indexing with Solr, make sure to do batch indexing with ES as well.

On Monday, March 5, 2012 at 4:21 PM, haarts wrote:

Hi Thomas,

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

On Monday, 5 March 2012 15:00:42 UTC+1, Thomas Peuss wrote:

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When you calculate the indexing speed of Solr you need to take the "commit"-time into account you don't need with ES. And you will see that Solr gets "slower".

Increasing the shard count is only useful if you have enough CPU cores and disk spindles available. You should increase the refresh_interval or disable it alltogether and trigger it manually if you get to really big index sizes (we have >1 billion docs in some indices while having >500 indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates so you should consider that.

But in the end you need to benchmark with your amount of data (docs per second, index size etc.) and different settings. If need to guarantee some sort of SLA there is no other way around it.

CU
Thomas

Thomas_Peuss · March 5, 2012, 4:03pm

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

haarts · March 5, 2012, 4:11pm

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

Craig_Brown · March 5, 2012, 4:45pm

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual
cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for
storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster
to index with 0 replicas if you can, then up the replica number after
indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec
initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our
own multi-threaded indexing tool to do the work. We enable _source and
compression on _source. We still have _all enabled though we are not using
it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts harmaarts@gmail.com wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind
of machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

kimchy · March 5, 2012, 8:31pm

Note that the merge factor parameter does not apply to the default tiered merge policy. In any case, setting it to 1 is not recommended, since you can always control the number of shards it will optimize down to in the optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster to index with 0 replicas if you can, then up the replica number after indexing, than it is to index with 1 or more replicas. We set refresh_interval to 30s and merge.policy.merge_factor to 30. After indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our own multi-threaded indexing tool to do the work. We enable _source and compression on _source. We still have _all enabled though we are not using it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaarts@gmail.com (mailto:harmaarts@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the cores for number crunching without I/O involved). Currently we are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16). Important here is that you wait for the response from ES because otherwise you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com (http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939

Lior_Cohen_barnea · March 6, 2012, 7:52am

Thanks for all the answers.

It seem that when i used a client node


Node node = nodeBuilder().client(true).node();
Client client = node.client()

my indexing time was much faster than when i used transport client to
a local node ...

should it be like that ?

On Mar 5, 10:31 pm, Shay Banon kim...@gmail.com wrote:

Note that the merge factor parameter does not apply to the default tiered merge policy. In any case, setting it to 1 is not recommended, since you can always control the number of shards it will optimize down to in the optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster to index with 0 replicas if you can, then up the replica number after indexing, than it is to index with 1 or more replicas. We set refresh_interval to 30s and merge.policy.merge_factor to 30. After indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our own multi-threaded indexing tool to do the work. We enable _source and compression on _source. We still have _all enabled though we are not using it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaa...@gmail.com (mailto:harmaa...@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the cores for number crunching without I/O involved). Currently we are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16). Important here is that you wait for the response from ES because otherwise you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com(http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939

Craig_Brown · March 6, 2012, 4:59pm

OK, thanks for the advice.

Craig

On Mon, Mar 5, 2012 at 1:31 PM, Shay Banon kimchy@gmail.com wrote:

Note that the merge factor parameter does not apply to the default tiered
merge policy. In any case, setting it to 1 is not recommended, since you
can always control the number of shards it will optimize down to in the
optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual
cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for
storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster
to index with 0 replicas if you can, then up the replica number after
indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec
initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our
own multi-threaded indexing tool to do the work. We enable _source and
compression on _source. We still have _all enabled though we are not using
it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts harmaarts@gmail.com wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising...

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

Craig_Brown · March 6, 2012, 5:02pm

When you connect to the cluster this way, your client becomes a node in the
cluster and has access to all of the routing and other information. The
client is therefore much more efficient because it can directly communicate
with the node that has to do the work.

Craig

On Tue, Mar 6, 2012 at 12:52 AM, Lior Cohen barnea <
liorcohenbarnea@gmail.com> wrote:

Thanks for all the answers.

It seem that when i used a client node
Node node = nodeBuilder().client(true).node(); Client client = node.client()
my indexing time was much faster than when i used transport client to
a local node ...

should it be like that ?

On Mar 5, 10:31 pm, Shay Banon kim...@gmail.com wrote:

Note that the merge factor parameter does not apply to the default
tiered merge policy. In any case, setting it to 1 is not recommended, since
you can always control the number of shards it will optimize down to in the
optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8
virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS
instance for storage. We run 26 shards with 0 replicas when indexing. It's
MUCH faster to index with 0 replicas if you can, then up the replica number
after indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k
docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We
wrote our own multi-threaded indexing tool to do the work. We enable
_source and compression on _source. We still have _all enabled though we
are not using it. We'll disable that in the next round.

Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaa...@gmail.com (mailto:
harmaa...@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16
disk machines.

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on
what kind of machines you are running? We are struggling indexing 500M
documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7
24GB, 1 simple spinner). Performance indexing is acceptable. But first time
query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per
DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16
disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use
50% of the cores for number crunching without I/O involved). Currently we
are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are
rising...

You should try to insert with many threads in parallel (we use
16). Important here is that you wait for the response from ES because
otherwise you will overload ES.

CU
Thomas

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com(http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939

--
…
CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

Topic		Replies	Views
Very slow ElasticSearch Index Elasticsearch	8	393	July 6, 2017
Slow Indexing Speed Elasticsearch	5	7233	July 6, 2017
Inserts get slower when index become large Elasticsearch	10	473	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2442	July 5, 2017
Adding nodes does not seem to speed up indexing Elasticsearch	8	1049	July 6, 2017

Improve indexing throughput

grep _MEM elasticsearch/bin/elasticsearch.in.sh

Otis

Otis

Related topics