Elasticsearch ingest performance

bparkison · April 22, 2015, 11:16pm

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting cluster
wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to improve
this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32 HT
cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting the
Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document sizes
are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB client
with 20 threads, writing to a single index with 5 shards and 0 replicas.
YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kimbro_Staken · April 23, 2015, 12:00am

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making sure
the load gets spread around. But are you sure ES is what's bottlenecking
here? With only 5 shards you're only using half your cluster but I'm
willing to bet your 20 threads on the importer isn't maxing that out. Also
you need to make sure the import process is spreading connections across
the nodes otherwise you may be limited in other ways by the node you're
connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core Avoton
CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest can
sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, bparkison@maprtech.com wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting cluster
wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to improve
this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32 HT
cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting the
Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document sizes
are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB client
with 20 threads, writing to a single index with 5 shards and 0 replicas.
YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

mikemccand · April 23, 2015, 9:25am

You can try the ideas here too:

Mike McCandless

On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken kstaken@kstaken.com wrote:

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making sure
the load gets spread around. But are you sure ES is what's bottlenecking
here? With only 5 shards you're only using half your cluster but I'm
willing to bet your 20 threads on the importer isn't maxing that out. Also
you need to make sure the import process is spreading connections across
the nodes otherwise you may be limited in other ways by the node you're
connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core Avoton
CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest can
sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, bparkison@maprtech.com wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting cluster
wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to improve
this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32 HT
cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting
the Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document
sizes are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB
client with 20 threads, writing to a single index with 5 shards and 0
replicas. YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKHUQPiy01pz8O-xygCL5%3D%2BKjLpC-Yos5DpHB__%3Dx9G%2B-QKkjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

derickson · April 23, 2015, 12:00pm

Just some thoughts. Yeah, with 16 cores per machine and 10 machines having
5 shards per index is probably too low.

What are your system metrics telling you? Are the CPUs idle? What does
the CPU I/O wait look like?
Are you doing single index operations or batch index operations with YCSB?

Another thing to think about. YCSB was built to test the key/value
performance properties of a database. If I remember correctly the values
put into the strings are randomly generated. Pure random is about the
worst case possible for cardinality when it comes to full text indexing
data structures, so you might want to adjust for that when creating your
mapping for the index. If the values are pure random rather than randomly
pulled from a dictionary of fixed length (English only has 200k or so
words) then the data you are putting in may be penalizing ES for having
indexing features turned on by default.

On Thursday, April 23, 2015 at 5:25:30 AM UTC-4, Michael McCandless wrote:

You can try the ideas here too:
Performance Considerations for Elasticsearch Indexing | Elastic Blog

Mike McCandless

On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken <kst...@kstaken.com
<javascript:>> wrote:

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making
sure the load gets spread around. But are you sure ES is what's
bottlenecking here? With only 5 shards you're only using half your cluster
but I'm willing to bet your 20 threads on the importer isn't maxing that
out. Also you need to make sure the import process is spreading connections
across the nodes otherwise you may be limited in other ways by the node
you're connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core Avoton
CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest can
sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, <bpar...@maprtech.com <javascript:>>
wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting
cluster wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to
improve this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32
HT cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting
the Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document
sizes are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB
client with 20 threads, writing to a single index with 5 shards and 0
replicas. YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d21b5619-6c3c-4f8a-bd8c-6d3c3f05d8cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Milind_Shah · April 23, 2015, 2:52pm

Thank you all for your inputs. I am working with Brian on this exercise as
well. Let me try answering some of the questions.

CPU usage:
There are only 2 cores of CPU in use. When I monitor the disk usage, I
notice that for every 2 minutes or so, the disk usage goes close to 100%
for about 10 seconds and for rest of the time it remains below 10%. We have
set ES_HEAP_SIZE to 30gb and the machines have 128gb RAM available.

YCSB:
YCSB is generating 23 bytes key with long integer and 'user' as prefix (ex.
user9348485929) and for data it is generating random bytes to fill up the
10 fields. All the keys that we're inserting are unique keys. YCSB does
single index operation.

Here is the insert code of YCSB
(https://github.com/brianfrankcooper/YCSB/blob/master/elasticsearch/src/main/java/com/yahoo/ycsb/db/ElasticSearchClient.java):

public int insert(String table, String key, HashMap<String, ByteIterator>
values) { try { final XContentBuilder doc = jsonBuilder().startObject(); for
(Entry<String, String> entry : StringByteIterator.getStringMap(values).entrySet())
{ doc.field(entry.getKey(), entry.getValue()); } doc.endObject(); client.prepareIndex(indexKey,
table, key) .setSource(doc) .execute() .actionGet(); return 0; } catch (
Exception e) { e.printStackTrace(); } return 1; }

Over goal is to write more data (1-2 TB) later on to the same index. The
10GB insert is just to have ES tuned for workload. For this usecase, what
would be a recommended # of shards? Is there a data to number of shards
ratio that we should keep in mind while going forward?

Thanks,
Milind

On Thursday, April 23, 2015 at 5:00:56 AM UTC-7, da...@elastic.co wrote:

Just some thoughts. Yeah, with 16 cores per machine and 10 machines
having 5 shards per index is probably too low.

What are your system metrics telling you? Are the CPUs idle? What does
the CPU I/O wait look like?
Are you doing single index operations or batch index operations with YCSB?

Another thing to think about. YCSB was built to test the key/value
performance properties of a database. If I remember correctly the values
put into the strings are randomly generated. Pure random is about the
worst case possible for cardinality when it comes to full text indexing
data structures, so you might want to adjust for that when creating your
mapping for the index. If the values are pure random rather than randomly
pulled from a dictionary of fixed length (English only has 200k or so
words) then the data you are putting in may be penalizing ES for having
indexing features turned on by default.

On Thursday, April 23, 2015 at 5:25:30 AM UTC-4, Michael McCandless wrote:

You can try the ideas here too:
Performance Considerations for Elasticsearch Indexing | Elastic Blog

Mike McCandless

On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken kst...@kstaken.com
wrote:

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making
sure the load gets spread around. But are you sure ES is what's
bottlenecking here? With only 5 shards you're only using half your cluster
but I'm willing to bet your 20 threads on the importer isn't maxing that
out. Also you need to make sure the import process is spreading connections
across the nodes otherwise you may be limited in other ways by the node
you're connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core Avoton
CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest can
sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, bpar...@maprtech.com wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting
cluster wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to
improve this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32
HT cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting
the Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document
sizes are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB
client with 20 threads, writing to a single index with 5 shards and 0
replicas. YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6bb52b49-4656-4c67-beac-ff817837a6ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

derickson · April 23, 2015, 3:47pm

No idea how many shards you need. Try 10, 15, 20, 25 and see how the
numbers coming from YCSB and system stat change.

Wow, that github repo hasn't been touched in 3 years. Elasticsearch and
the java client for Elasticsearch have probably changes a bit since then,
so be careful about what you read into the results you interpret from this
test.

I'd spend some time understanding mappings in Elasticsearch index
settings. If you aren't going to query by the 10 secondary fields, and
just want to be able to do key/value retrievals in your YCSB test, it would
probably make your load test more realistic. Full text indexing purely
random values isn't a realistic workload compared to real data. Here's a
quick code sample for sense that shows an example of turning off a few
things to prevent indexing stuff you don't need for YCSB

gist.github.com

https://gist.github.com/derickson/61d66ed7d6a911db9cf0

not_indexing_fields

DELETE /storedvsunstoredtest

## notice I disable the _all text index and the index: no setting on field2
PUT /storedvsunstoredtest
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 3
  },
  "mappings": {

This file has been truncated. show original

If you really want to know if ES is a good fit, I'd recommend doing bulk
loads of 10 GB of real data

Dave Erickson / Elastic / Solutions Architect / Washington, DC
mobile: +1 401 965 4130

On Thu, Apr 23, 2015 at 10:52 AM, Milind Shah milindsjsu@gmail.com wrote:

Thank you all for your inputs. I am working with Brian on this exercise as
well. Let me try answering some of the questions.

CPU usage:
There are only 2 cores of CPU in use. When I monitor the disk usage, I
notice that for every 2 minutes or so, the disk usage goes close to 100%
for about 10 seconds and for rest of the time it remains below 10%. We have
set ES_HEAP_SIZE to 30gb and the machines have 128gb RAM available.

YCSB:
YCSB is generating 23 bytes key with long integer and 'user' as prefix
(ex. user9348485929) and for data it is generating random bytes to fill up
the 10 fields. All the keys that we're inserting are unique keys. YCSB does
single index operation.

Here is the insert code of YCSB (
https://github.com/brianfrankcooper/YCSB/blob/master/elasticsearch/src/main/java/com/yahoo/ycsb/db/ElasticSearchClient.java
):

public int insert(String table, String key, HashMap<String, ByteIterator>
values) { try { final XContentBuilder doc = jsonBuilder().startObject();
for (Entry<String, String> entry : StringByteIterator.getStringMap(values)
.entrySet()) { doc.field(entry.getKey(), entry.getValue()); } doc.
endObject(); client.prepareIndex(indexKey, table, key) .setSource(doc)
.execute() .actionGet(); return 0; } catch (Exception e) { e.
printStackTrace(); } return 1; }

Over goal is to write more data (1-2 TB) later on to the same index. The
10GB insert is just to have ES tuned for workload. For this usecase, what
would be a recommended # of shards? Is there a data to number of shards
ratio that we should keep in mind while going forward?

Thanks,
Milind

On Thursday, April 23, 2015 at 5:00:56 AM UTC-7, da...@elastic.co wrote:

Just some thoughts. Yeah, with 16 cores per machine and 10 machines
having 5 shards per index is probably too low.

What are your system metrics telling you? Are the CPUs idle? What does
the CPU I/O wait look like?
Are you doing single index operations or batch index operations with YCSB?

Another thing to think about. YCSB was built to test the key/value
performance properties of a database. If I remember correctly the values
put into the strings are randomly generated. Pure random is about the
worst case possible for cardinality when it comes to full text indexing
data structures, so you might want to adjust for that when creating your
mapping for the index. If the values are pure random rather than randomly
pulled from a dictionary of fixed length (English only has 200k or so
words) then the data you are putting in may be penalizing ES for having
indexing features turned on by default.

On Thursday, April 23, 2015 at 5:25:30 AM UTC-4, Michael McCandless wrote:

You can try the ideas here too:
Performance Considerations for Elasticsearch Indexing | Elastic Blog

Mike McCandless

On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken kst...@kstaken.com
wrote:

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making
sure the load gets spread around. But are you sure ES is what's
bottlenecking here? With only 5 shards you're only using half your cluster
but I'm willing to bet your 20 threads on the importer isn't maxing that
out. Also you need to make sure the import process is spreading connections
across the nodes otherwise you may be limited in other ways by the node
you're connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core
Avoton CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest
can sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, bpar...@maprtech.com wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting
cluster wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to
improve this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32
HT cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting
the Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document
sizes are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB
client with 20 threads, writing to a single index with 5 shards and 0
replicas. YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/x1hYRTO7znQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6bb52b49-4656-4c67-beac-ff817837a6ab%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6bb52b49-4656-4c67-beac-ff817837a6ab%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJMkaiZF-%3DTg03dfu-rApZt%2BxpzEgRbzAVv%2Bu504z9j-kakRHQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Kimbro_Staken · April 23, 2015, 3:51pm

I think much of your problem at this stage is YCSB. It doesn't sound like
it's even close to pushing the limits of your cluster.

If you just want synthetic data makelogs[1] can be used to generate data
that is a little more real world looking and it uses bulk requests. A
single running instance of makelogs won't max out your cluster but running
10,20,30 copies simultaneously might. Just be aware that the default data
makelogs creates has sub-documents in it. If your real data will not have
subdocs then disabling those fields (--omit geo --omit meta --omit machine
--omit relatedContent) will show much higher throughput.

[1] GitHub - spalger/makelogs: Simple script that generates sample logs for testing kibana

On Thu, Apr 23, 2015 at 7:52 AM, Milind Shah milindsjsu@gmail.com wrote:

Thank you all for your inputs. I am working with Brian on this exercise as
well. Let me try answering some of the questions.

CPU usage:
There are only 2 cores of CPU in use. When I monitor the disk usage, I
notice that for every 2 minutes or so, the disk usage goes close to 100%
for about 10 seconds and for rest of the time it remains below 10%. We have
set ES_HEAP_SIZE to 30gb and the machines have 128gb RAM available.

YCSB:
YCSB is generating 23 bytes key with long integer and 'user' as prefix
(ex. user9348485929) and for data it is generating random bytes to fill up
the 10 fields. All the keys that we're inserting are unique keys. YCSB does
single index operation.

Here is the insert code of YCSB (
https://github.com/brianfrankcooper/YCSB/blob/master/elasticsearch/src/main/java/com/yahoo/ycsb/db/ElasticSearchClient.java
):

public int insert(String table, String key, HashMap<String, ByteIterator>
values) { try { final XContentBuilder doc = jsonBuilder().startObject();
for (Entry<String, String> entry : StringByteIterator.getStringMap(values)
.entrySet()) { doc.field(entry.getKey(), entry.getValue()); } doc.
endObject(); client.prepareIndex(indexKey, table, key) .setSource(doc)
.execute() .actionGet(); return 0; } catch (Exception e) { e.
printStackTrace(); } return 1; }

Over goal is to write more data (1-2 TB) later on to the same index. The
10GB insert is just to have ES tuned for workload. For this usecase, what
would be a recommended # of shards? Is there a data to number of shards
ratio that we should keep in mind while going forward?

Thanks,
Milind

On Thursday, April 23, 2015 at 5:00:56 AM UTC-7, da...@elastic.co wrote:

Just some thoughts. Yeah, with 16 cores per machine and 10 machines
having 5 shards per index is probably too low.

What are your system metrics telling you? Are the CPUs idle? What does
the CPU I/O wait look like?
Are you doing single index operations or batch index operations with YCSB?

Another thing to think about. YCSB was built to test the key/value
performance properties of a database. If I remember correctly the values
put into the strings are randomly generated. Pure random is about the
worst case possible for cardinality when it comes to full text indexing
data structures, so you might want to adjust for that when creating your
mapping for the index. If the values are pure random rather than randomly
pulled from a dictionary of fixed length (English only has 200k or so
words) then the data you are putting in may be penalizing ES for having
indexing features turned on by default.

On Thursday, April 23, 2015 at 5:25:30 AM UTC-4, Michael McCandless wrote:

You can try the ideas here too:
Performance Considerations for Elasticsearch Indexing | Elastic Blog

Mike McCandless

On Wed, Apr 22, 2015 at 8:00 PM, Kimbro Staken kst...@kstaken.com
wrote:

Hello Brian,

Many things will affect the rate of ingest, the biggest one is making
sure the load gets spread around. But are you sure ES is what's
bottlenecking here? With only 5 shards you're only using half your cluster
but I'm willing to bet your 20 threads on the importer isn't maxing that
out. Also you need to make sure the import process is spreading connections
across the nodes otherwise you may be limited in other ways by the node
you're connecting to. Also make sure the client is using bulk requests and
experiment with the bulk sizes.

FYI, I've been testing a new system configuration using an 8 core
Avoton CPU with 6 x SSDs in a RAID 0. On this system (single node) ingest
can sustain around 3,500 docs/sec of similar size to your load before it
becomes CPU bound. You have much more CPU capacity so I would expect your
hardware to be able to exceed this by a fair margin, your current numbers
don't show that.

Kimbro Staken

On Wed, Apr 22, 2015 at 4:16 PM, bpar...@maprtech.com wrote:

We are running a 10-node Elasticsearch 1.4.2 cluster, and getting
cluster wide throughput of 18161 docs/sec, or about 18MB/sec. We'd like to
improve this as much as we can, without impacting query times too much.

Our hardware:

RAM: 128GB
Disks: 8 disks, 7200 RPM, 1TB in a RAID 0 array
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz - 16 physical cores, 32
HT cores
Network: 1x10gbe

They are running CentOS 6.5, and java version 1.7.0_67. We're setting
the Elasticsearch heap size to 30GB.

We are testing ingest by inserting 10GB of data with YCSB. Document
sizes are 1KB, with 10 string fields, each 100 bytes. There is 1 YCSB
client with 20 threads, writing to a single index with 5 shards and 0
replicas. YCSB connects using the Java Node Client.

What is the expected ingest rate in this type of environment? What
parameters are recommended to increase the ingest rate?

Thanks,

Brian

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5733e9d0-b877-4dc3-b5c1-d341365ec6b2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXbAOBrWgb6O0PHD3LgBiqfckSnifAGVura1%2BQ05f1d-LA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6bb52b49-4656-4c67-beac-ff817837a6ab%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6bb52b49-4656-4c67-beac-ff817837a6ab%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXYWANCUA4b_vUCKV74Z%3DYPbeD6txR50KgTXi_MsHYNobA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

bparkison · May 6, 2015, 5:45pm

After a little delay, we've gone through some of the suggestions in here,
and made some changes:

We disabled indexing. For our current purposes, we don't need to
search. With indexing disabled, is there any reason that random strings
would be worse than real strings?
We increased the number of shards to 30.
We increased the heap size to 64GB.

We also switch over to using our own code to push data into Elasticsearch.
We tested with 1, 4, and 8 node clients, with the following results:

1 client : 23 MB/sec
4 clients: 60 MB/sec
8 clients: 85 MB/sec

All of these clients were running in a single JVM, but we weren't maxed out
on CPU, disk, or network on that system. We also tested with running 4
clients in each of two JVMs (on two different systems), and got throughput
of 60MB/sec.

Should we expect to see better scalability as we increase the number of
clients? Would using transport clients instead of node clients make things
better or worse?

Thanks,

Brian

--
Please update your bookmarks! We moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70c5df3d-eceb-434f-9275-04a83bc6952d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mohitkris · June 5, 2017, 7:01pm

Hi Kimbro:

I am fairly new to elastic search and can definitely use some help in solving issues with our elastic search deployment.

We are experiencing issues with our elastic search cluster. We have a 30 node cluster with the following configuration:

48 Cores
128 GB
SSDs with IOPS of 35,000 and 138 MB/sec read and write speeds.

Currently we are getting 2000 docs/sec speed on the cluster. We are looking to improve the performance atleast 5 folds as the system metrics (CPU load, Heap Etc.) are very low at the moment indicating it has potential for optimization.

My elastic search configuration reads:

bootstrap.mlockall: true
cluster.name: <cluster_name>
discovery.zen.minumum_master_nodes: '3'
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts:

<elastic_search_master_node1_hostname>
<elastic_search_master_node2_hostname>
<elastic_search_master_node3_hostname>
<elastic_search_master_node4_hostname>
<elastic_search_master_node5_hostname>
gateway.expected_nodes: '29'
http.cors.allow-origin: '*'
http.cors.enabled: true
http.port: '9200'
index.number_of_replicas: '1'
index.number_of_shards: '3'
index.refresh_interval: 1m
network.host: 0.0.0.0
node.data: true
node.master: true
transport.bind_host: 0.0.0.0
transport.tcp.port: '9300'

node.name: <Current_Node_hostname>

#################################### Paths ####################################

Path to directory containing configuration (this file and logging.yml):

path.conf: /etc/elasticsearch/<instance_name>

path.data: /data/<data_dir>

path.work: /tmp/elasticsearch/<work_dir>

path.logs: /var/log/elasticsearch/<log_dir>

path.plugins: /usr/share/elasticsearch/plugins/<plugins_dir>

Topic		Replies	Views
Improving Elasticsearach ingest capacity Elasticsearch	7	111	June 20, 2024
Ingestion performance issues - where to start? Elasticsearch	6	655	September 18, 2020
Poor ingest node and indexing performance Elasticsearch	1	1013	April 6, 2017
Slow ingestion problem (v 6.2.3) Elasticsearch	14	3649	July 22, 2018
Elasticsearch Query performance while continuous ingestion Elasticsearch	5	577	August 15, 2019

Elasticsearch ingest performance

-- Please update your bookmarks! We moved to https://discuss.elastic.co/

Path to directory containing configuration (this file and logging.yml):

Related topics

--
Please update your bookmarks! We moved to https://discuss.elastic.co/