Improve indexing throughput


(Lior Cohen barnea) #1

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.


(Mark Waddle) #2

Interesting. My experience so far has been that ES has equivalent or faster
indexing than Solr. Are you doing bulk indexing or incremental indexing? If
you are doing bulk indexing you should read
http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html
.

On Sunday, March 4, 2012 1:05:58 AM UTC-8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.


(Radu Gheorghe) #3

Hi,

I just had a "facepalm moment" while doing some performance tests on
cloud machines. Maybe it also applies to you:

the default memory settings for Elasticsearch are:

grep _MEM elasticsearch/bin/elasticsearch.in.sh

if [ "x$ES_MIN_MEM" = "x" ]; then
ES_MIN_MEM=256m
if [ "x$ES_MAX_MEM" = "x" ]; then
ES_MAX_MEM=1g
JAVA_OPTS="$JAVA_OPTS -Xms${ES_MIN_MEM}"
JAVA_OPTS="$JAVA_OPTS -Xmx${ES_MAX_MEM}"

This might turn out to be too little when inserting a huge amount of
logs for performance tests.

When I raised this limit (eg: 4g min, 20g max), the inserting
performance suddenly multiplied by 3.

Take a look here for some more info:
http://www.elasticsearch.org/guide/reference/setup/installation.html

For performance, it's best to set the minimum and maximum memory to
the same value, in order to avoid using the Garbage Collector as much
as possible.

On Mar 4, 11:05 am, Lior Cohen barnea liorcohenbar...@gmail.com
wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.


(Otis Gospodnetić) #4

Hello Lior,

Well, ElasticSearch has real-time replication, so ES has to do more work.
Also, ES has index refreshing happening every 1 sec by default, while Solr
doesn't (or are you comparing ES and SolrCloud?).
...
I think more shards would speed things up, as would bulk API and setting
the number of replicas to 0.
You could also increase the number of concurrent indexers in order to
maximize CPU and/or disk IO.

Otis

Sematext is hiring ElasticSearch engineers World-Wide --

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.


(Thomas Peuss) #5

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When you
calculate the indexing speed of Solr you need to take the "commit"-time
into account you don't need with ES. And you will see that Solr gets
"slower".

Increasing the shard count is only useful if you have enough CPU cores and
disk spindles available. You should increase the refresh_interval or
disable it alltogether and trigger it manually if you get to really big
index sizes (we have >1 billion docs in some indices while having >500
indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates so
you should consider that.

But in the end you need to benchmark with your amount of data (docs per
second, index size etc.) and different settings. If need to guarantee some
sort of SLA there is no other way around it.

CU
Thomas


(haarts) #6

Hi Thomas,

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

On Monday, 5 March 2012 15:00:42 UTC+1, Thomas Peuss wrote:

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When
you calculate the indexing speed of Solr you need to take the "commit"-time
into account you don't need with ES. And you will see that Solr gets
"slower".

Increasing the shard count is only useful if you have enough CPU cores and
disk spindles available. You should increase the refresh_interval or
disable it alltogether and trigger it manually if you get to really big
index sizes (we have >1 billion docs in some indices while having >500
indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates
so you should consider that.

But in the end you need to benchmark with your amount of data (docs per
second, index size etc.) and different settings. If need to guarantee some
sort of SLA there is no other way around it.

CU
Thomas


(Jörg Prante) #7

Hi,

here are some techniques I used for Elasticsearch indexing:

  • parallel multithreading custom remote index client (12 instances in
    parallel, depends on your input data streams)
  • remote index client for distributing feed load and ES node load. ES
    TransportClient is threadsafe, a single instance can be re-used for
    parallel indexing.
  • sharding (12 shards / primaries plus 1x replica = 24 Lucene indexes
    over 3 nodes in total)
  • bulk indexing, balanced by waiting for call backs, limited to max.
    30 jobs with 100 docs each, each doc is ~2-200 KB, 4KB average
  • compressed input (reduces I/O bandwidth with large text documents)
  • RHEL, ES data volume mounted with noatime option, SAS-2 (6Gbit/s)
    drives, RAID 5
  • default ES config changes, no change of refresh_interval, merge
    policies etc. = no ES tuning at all

This config can index constantly ~3000 docs per second for ~2 hours to
Elasticsearch (for a total of ~20 mio. documents). Each ~20 min on
each node, load spikes happen, because Lucene merges take heavy I/O.
This could be optimized but for my purposes it suffices right now.

I am afraid comparing an ES cluster setup to a single Solr index seems
quite unfair.

Jörg

On Mar 5, 12:58 pm, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

Hello Lior,

Well, ElasticSearch has real-time replication, so ES has to do more work.
Also, ES has index refreshing happening every 1 sec by default, while Solr
doesn't (or are you comparing ES and SolrCloud?).
...
I think more shards would speed things up, as would bulk API and setting
the number of replicas to 0.
You could also increase the number of concurrent indexers in order to
maximize CPU and/or disk IO.

Otis

Sematext is hiring ElasticSearch engineers World-Wide --http://sematext.com/about/jobs.html

On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.
On Sunday, March 4, 2012 5:05:58 PM UTC+8, Lior Cohen barnea wrote:

Hi all,

I am doing a POC on ES and Solr cloud.

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

Thanks in advance,
Lior.


(Shay Banon) #8

A few more things to add to this great thread:

  1. By default, elasticsearch also indexed all the json data into an _all field. This is very convenient, but does add to the indexing time (and index size). You can easily disable it with the all mapping, but its different compared to Solr, which doesn't do it.
  2. It was already mentioned, but indexing process is different compared to Solr (I am not sure which SolrCloud you are using, but it has changed considerable over time). It does active replication, and there is no need to commit (ES commits periodically by itself, and has a transaction log). See this video for a bit more info: http://www.elasticsearch.org/videos/2011/08/09/road-to-a-distributed-searchengine-berlinbuzzwords.html.
  3. By default, ES stores the whole json so you can easily get it back if you issue a GET or a search. If you don't store anything in your solr mappings, then it will incur that additional overhead comparing the two.
  4. When indexing data, if your client ends up doing batch indexing with Solr, make sure to do batch indexing with ES as well.

On Monday, March 5, 2012 at 4:21 PM, haarts wrote:

Hi Thomas,

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

On Monday, 5 March 2012 15:00:42 UTC+1, Thomas Peuss wrote:

Hi Lior!

Am Sonntag, 4. März 2012 10:05:58 UTC+1 schrieb Lior Cohen barnea:

I want to improve indexing throughput which is quite less than Solr's.
What approach should i take:
Increase number of shards and/or nodes?
Config my client with all nodes IP's?
Run 1 thread per node and configure 1 node IP per client?
Changes the segments settings ? if so, how?

First of all make sure that you don't compare apples with oranges. When you calculate the indexing speed of Solr you need to take the "commit"-time into account you don't need with ES. And you will see that Solr gets "slower".

Increasing the shard count is only useful if you have enough CPU cores and disk spindles available. You should increase the refresh_interval or disable it alltogether and trigger it manually if you get to really big index sizes (we have >1 billion docs in some indices while having >500 indices). If you run on many nodes your network can be the next

Bulk indexing gives you a big performance gain over doing single updates so you should consider that.

But in the end you need to benchmark with your amount of data (docs per second, index size etc.) and different settings. If need to guarantee some sort of SLA there is no other way around it.

CU
Thomas


(Thomas Peuss) #9

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas


(haarts) #10

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas


(Craig Brown) #11

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual
cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for
storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster
to index with 0 replicas if you can, then up the replica number after
indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec
initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our
own multi-threaded indexing tool to do the work. We enable _source and
compression on _source. We still have _all enabled though we are not using
it. We'll disable that in the next round.

  • Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts harmaarts@gmail.com wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind
of machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939


(Shay Banon) #12

Note that the merge factor parameter does not apply to the default tiered merge policy. In any case, setting it to 1 is not recommended, since you can always control the number of shards it will optimize down to in the optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster to index with 0 replicas if you can, then up the replica number after indexing, than it is to index with 1 or more replicas. We set refresh_interval to 30s and merge.policy.merge_factor to 30. After indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our own multi-threaded indexing tool to do the work. We enable _source and compression on _source. We still have _all enabled though we are not using it. We'll disable that in the next round.

  • Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaarts@gmail.com (mailto:harmaarts@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the cores for number crunching without I/O involved). Currently we are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16). Important here is that you wait for the response from ES because otherwise you will overload ES.

CU
Thomas

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com (http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939


(Lior Cohen barnea) #13

Thanks for all the answers.

It seem that when i used a client node

Node node = nodeBuilder().client(true).node(); Client client = node.client()

my indexing time was much faster than when i used transport client to
a local node ...

should it be like that ?

On Mar 5, 10:31 pm, Shay Banon kim...@gmail.com wrote:

Note that the merge factor parameter does not apply to the default tiered merge policy. In any case, setting it to 1 is not recommended, since you can always control the number of shards it will optimize down to in the optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster to index with 0 replicas if you can, then up the replica number after indexing, than it is to index with 1 or more replicas. We set refresh_interval to 30s and merge.policy.merge_factor to 30. After indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our own multi-threaded indexing tool to do the work. We enable _source and compression on _source. We still have _all enabled though we are not using it. We'll disable that in the next round.

  • Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaa...@gmail.com (mailto:harmaa...@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of machines you are running? We are struggling indexing 500M documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1 simple spinner). Performance indexing is acceptable. But first time query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the cores for number crunching without I/O involved). Currently we are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16). Important here is that you wait for the response from ES because otherwise you will overload ES.

CU
Thomas

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com(http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939


(Craig Brown) #14

OK, thanks for the advice.

  • Craig

On Mon, Mar 5, 2012 at 1:31 PM, Shay Banon kimchy@gmail.com wrote:

Note that the merge factor parameter does not apply to the default tiered
merge policy. In any case, setting it to 1 is not recommended, since you
can always control the number of shards it will optimize down to in the
optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8 virtual
cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS instance for
storage. We run 26 shards with 0 replicas when indexing. It's MUCH faster
to index with 0 replicas if you can, then up the replica number after
indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.
Our documents are about 2k-5k in size and we index about 10k-12k docs/sec
initially. After 240m docs, we're in the 5k-6k docs/sec range. We wrote our
own multi-threaded indexing tool to do the work. We enable _source and
compression on _source. We still have _all enabled though we are not using
it. We'll disable that in the next round.

  • Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts harmaarts@gmail.com wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16 disk
machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on what kind of
machines you are running? We are struggling indexing 500M documents,
reaching 1000+ inserts per second on a 3 node cluster (8 core i7 24GB, 1
simple spinner). Performance indexing is acceptable. But first time query
performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per DC). Each
machine has 24 cores, 32GB RAM and 8 disks (extendable to 16 disks) running
RHEL 6.1. The machines are not dedicated to ES alone (we use 50% of the
cores for number crunching without I/O involved). Currently we are running
with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are rising... :wink:

You should try to insert with many threads in parallel (we use 16).
Important here is that you wait for the response from ES because otherwise
you will overload ES.

CU
Thomas

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939


(Craig Brown) #15

When you connect to the cluster this way, your client becomes a node in the
cluster and has access to all of the routing and other information. The
client is therefore much more efficient because it can directly communicate
with the node that has to do the work.

  • Craig

On Tue, Mar 6, 2012 at 12:52 AM, Lior Cohen barnea <
liorcohenbarnea@gmail.com> wrote:

Thanks for all the answers.

It seem that when i used a client node

Node node = nodeBuilder().client(true).node(); Client client = node.client()

my indexing time was much faster than when i used transport client to
a local node ...

should it be like that ?

On Mar 5, 10:31 pm, Shay Banon kim...@gmail.com wrote:

Note that the merge factor parameter does not apply to the default
tiered merge policy. In any case, setting it to 1 is not recommended, since
you can always control the number of shards it will optimize down to in the
optimize call API.

On Monday, March 5, 2012 at 6:45 PM, Craig Brown wrote:

We're running on AWS, 4 C1-XL nodes - 7GB ram, 20 compute units (8
virtual cores). We allocate 4GB ram to ES. Each node as 1-500GB EBS
instance for storage. We run 26 shards with 0 replicas when indexing. It's
MUCH faster to index with 0 replicas if you can, then up the replica number
after indexing, than it is to index with 1 or more replicas. We set
refresh_interval to 30s and merge.policy.merge_factor to 30. After
indexing, we set them back to 1s and 1 and run optimize. This really helps.

Our documents are about 2k-5k in size and we index about 10k-12k
docs/sec initially. After 240m docs, we're in the 5k-6k docs/sec range. We
wrote our own multi-threaded indexing tool to do the work. We enable
_source and compression on _source. We still have _all enabled though we
are not using it. We'll disable that in the next round.

  • Craig

On Mon, Mar 5, 2012 at 9:11 AM, haarts <harmaa...@gmail.com (mailto:
harmaa...@gmail.com)> wrote:

Thanks a lot for the insight! I'd better convince by boss to buy 16
disk machines. :slight_smile:

On Monday, 5 March 2012 17:03:38 UTC+1, Thomas Peuss wrote:

Hi!

Am Montag, 5. März 2012 15:21:11 UTC+1 schrieb haarts:

Those are some impressive numbers. Would you mind sharing on
what kind of machines you are running? We are struggling indexing 500M
documents, reaching 1000+ inserts per second on a 3 node cluster (8 core i7
24GB, 1 simple spinner). Performance indexing is acceptable. But first time
query performance isn't great (seconds...).

We are running a 8-node cluster in two datacenters (4 nodes per
DC). Each machine has 24 cores, 32GB RAM and 8 disks (extendable to 16
disks) running RHEL 6.1. The machines are not dedicated to ES alone (we use
50% of the cores for number crunching without I/O involved). Currently we
are running with 16 shards and 1 replica.

We are currently peaking at 400 docs/s but the numbers are
rising... :wink:

You should try to insert with many threads in parallel (we use
16). Important here is that you wait for the response from ES because
otherwise you will overload ES.

CU
Thomas

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com(http://www.youwho.com/)

T: 801.855. 0921
M: 801.913. 0939

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939


(system) #16