Understanding scaling & clustering decision making

Hello,

We are in the midst of transitioning much of our data into ES, but have
been running into performance issues loading data into our index.

As recommended in other threads on this group, I've created a separate
testbed for understanding the interaction of ES with our data.

Summary question:

  • What are the main resource constraints and flags that are indicators of
    hardware and cluster scaling needs? Is it a max number of my certain type X
    of documents for a shard and the corresponding disk and memory space its
    running takes? Our experience below makes it difficult (for me currently)
    to assess a clear decision making path based upon the behavior I am
    observing.

Long story:

We have 2 major document types so we've created 2 indices, one for each
type. For our testing to assess the "limits" of our data and ES, we set
them to be single shard indices.

On a single node, it's a Rackspace Cloud 8GB machine.
We've set java heap to be 75% of available RAM, so that's 6GB and the
machine is assigned 4 CPUs

We're running bulk indexing with the following settings for the single
shard indices:

{
"index": {
"index.number_of_replicas": "0",
"merge.policy.merge_factor": 30,
"refresh_interval": "-1",
"store.throttle.max_bytes_per_sec": "5mb",
"store.throttle.type": "merge"
}
}

We are simultaneously running bulk operations on the two documents and
their respective indices.

On one index, FORM_INDEX,
it's humming along fine, bulk index operations of 500 documents have never
deviated over 2-3 seconds for the 500 document POST. They usually are in
the 100-500ms range for the 500 document POST bulk indexing operations.

As of this writing, the Index (and lone shard) size is 1.2gb, 300,000
documents - no change in performance. So far so good!

On the other index, CASE_INDEX
It was humming along fine, bulk index operations of 500 documents were in
the sub 1000 ms range. However, once it hit 200,000 documents, the POSTs
started inching up in time - 10 seconds, 30, 60, 120 seconds.

It was interminably slow for about an hour during this time, but then it
inexplicably sped up to < 1 second bulk insert times. It experienced
another blowup in insertion times, but now it is back to sub 1500ms
insertion times.

the CASE_INDEX stats right now are 7gb index (and lone shard) size, 266,000
documents.

ES is reporting its heap at 943mb, bigdesk

So initially I'm thinking it's understandable that CASE_INDEX would have
some slowdown compared to FORM_INDEX due to the complexity of the mappings
we have defined and the indexing we do on it...but I would have expected a
more graceful decline in performance showing "you are reaching the end of
the line on this shard" - but instead it's rather bursty.

My main question that I am seeking a decision making heuristic of some sort
in understanding when to up the resources on our cluster, or play with our
shard count and JVM settings. Is my desired outcome some magic number
limit for num_docs and size for CASE_INDEX and XFORM_INDEX respectively for
a given machine RAM size?

I ask this because I ran into a similar precipitous decline in bulk indexed
docs per second on our other testbed with 5 shard indices that also
magically recovered with no intervention on our end. This is the same 8GB
hardware running CASE_INDEX and FORM_INDEX at 5 shards and 500k and 1
million docs in each respectively. 4gb out of 6gb heap, and a similar bulk
indexing operation ran into similar fast+slow bulk insertion rates.

I'm ill equipped to explain and justify throwing more (virtual) hardware to
our environment when our single shard setup is exhibiting the same behavior
as our presumably overburdened machine.

Thanks,

Dan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

thanks for telling so many details. It is just that your index scenario
CASE_INDEX with 7G size was hit by Lucene merging, I think. Do you
monitor CPU/memory/load of your nodes?

Note the index is bigger than heap memory, and ES must have started up
the Lucene segment merging which involves very heavy on I/O read/write.
This merging is vital, but it can bring systems near to a halt for a
period of time, maybe for seconds, maybe for minutes. Due to the sudden
heavy I/O wait in this step, you can't see a soft decline in
performance. It's an all or nothing. And there is no "magical recovery"

  • it is just the end of a large segment merge process. These large
    merges are rare, they may happen each 10 or 20 minutes or even hours
    after constant bulk indexing runs, depending on your machine setup.

First, your disk input/output subsystem seems too slow for your
requirements, think about faster disk writes. This is the slowest part
of the whole system and you will notice it can help quickly without
changing code.

Second, it also appears that merge.policy.merge_factor 30 may be a bit
too high for your system setup to handle the segment merge gracefully,
maybe you can think about reducing the number to 20, or even 10.

You can't estimate bulk cost only from counting documents or looking at
the mapping. Right, there are a few analyzers known for bad performance,
but this does not include most of the other analyzers, so mapping is in
most cases not critical for overall performance, unless you're doing
really wild things. A better measure for estimating costs of indexing a
document is the size of the existing index, the size of the document,
how frequent terms occur in the index and in the document, and the
maximum number of different/unique words/terms per field. You should
think in "streams" and byte volumes per second you put on the machine.

For improving bulk indexing "smoothness", you can study
org.elasticsearch.acion.bulk.BulkProcessor. It demonstrates how to set
an upper limit of concurrency. If the bulk processor exceeds a certain
concurrency rate, it just waits at client side. This helps to prevent ES
cluster from being overloaded, and the bulk ingest will behave more
predictable.

Jörg

Am 26.02.13 04:28, schrieb Daniel Myung:

It was interminably slow for about an hour during this time, but then
it inexplicably sped up to < 1 second bulk insert times. It
experienced another blowup in insertion times, but now it is back to
sub 1500ms insertion times.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To add to what Jörg said

First, your disk input/output subsystem seems too slow for your
requirements, think about faster disk writes. This is the slowest part
of the whole system and you will notice it can help quickly without
changing code.

Merges happen all the time in the background, and usually they are
pretty light (3 small segments get merged into a bigger, but still small
segment).

Unfortunately, when you have 3 big segments getting merged into eg a 5GB
segment, that can use a lot of I/O, which impacts the performance of the
system.

If you can afford SSDs, definitely go for it. Otherwise, you will want
to throttle the speed of merges -- currently they happen as fast as your
system allows.

Take a look at "store level throttling" on

Also, only leaving 25% of your RAM for the filesystem caches is
insufficient. We recommend 50% of RAM for the ES_HEAP and 50% for the
system. The file system caches help a lot with disk access speed

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for all information. I figured it was the merge operation going. In
my other tests and during my testing of Sematext's SPM, CPU and total
system memory usage were spiking like crazy during the slow times.

What was curious was during my 1 shard set, total OS memory usage was low,
hence my confusion.

I initially operated on the 50% RAM setup when I ran into the initial merge
operation - so I bumped it to 75% - I'll flip it back to 50% to see if
there are changes in its behavior.

You can't estimate bulk cost only from counting documents or looking at

the mapping. Right, there are a few analyzers known for bad performance,
but this does not include most of the other analyzers, so mapping is in
most cases not critical for overall performance, unless you're doing
really wild things. A better measure for estimating costs of indexing a
document is the size of the existing index, the size of the document,
how frequent terms occur in the index and in the document, and the
maximum number of different/unique words/terms per field. You should
think in "streams" and byte volumes per second you put on the machine.

So, if I read this, for a given index size and document, its impact is more
measured in the total I/O throughput it will incur on the index depending
on what content of it is being indexed within the index itself. In my
situation, once my index exceeded my heap space, it went to slow disk land
and that's where the things got challenging with merges and overall I/O
demands for each indexing operation.

So in my indexing _settings I set - i set a 5mb store level throttle, and
yet things still were taxing. Is that the indicator that my disks were
terrible? That it was set to 5mb and it still crushed my I/O?

Thanks again!
Dan

On Tuesday, February 26, 2013 5:17:37 AM UTC-5, Clinton Gormley wrote:

To add to what Jörg said

First, your disk input/output subsystem seems too slow for your
requirements, think about faster disk writes. This is the slowest part
of the whole system and you will notice it can help quickly without
changing code.

Merges happen all the time in the background, and usually they are
pretty light (3 small segments get merged into a bigger, but still small
segment).

Unfortunately, when you have 3 big segments getting merged into eg a 5GB
segment, that can use a lot of I/O, which impacts the performance of the
system.

If you can afford SSDs, definitely go for it. Otherwise, you will want
to throttle the speed of merges -- currently they happen as fast as your
system allows.

Take a look at "store level throttling" on
Elasticsearch Platform — Find real-time answers at scale | Elastic

Also, only leaving 25% of your RAM for the filesystem caches is
insufficient. We recommend 50% of RAM for the ES_HEAP and 50% for the
system. The file system caches help a lot with disk access speed

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Does your 5 shard testbed also have the same JVM memory settings (6GB out
of 8GB)? The same merge factor? A higher merge factor is great when you
have the memory. Once you move out of indexing testing and start searcing
against the index (while indexing), the memory pressures will become even
more evident.

--
Ivan

On Tue, Feb 26, 2013 at 9:44 AM, dmyung dmyung@dimagi.com wrote:

Thanks for all information. I figured it was the merge operation going. In
my other tests and during my testing of Sematext's SPM, CPU and total
system memory usage were spiking like crazy during the slow times.

What was curious was during my 1 shard set, total OS memory usage was low,
hence my confusion.

I initially operated on the 50% RAM setup when I ran into the initial
merge operation - so I bumped it to 75% - I'll flip it back to 50% to see
if there are changes in its behavior.

You can't estimate bulk cost only from counting documents or looking at

the mapping. Right, there are a few analyzers known for bad performance,
but this does not include most of the other analyzers, so mapping is in
most cases not critical for overall performance, unless you're doing
really wild things. A better measure for estimating costs of indexing a
document is the size of the existing index, the size of the document,
how frequent terms occur in the index and in the document, and the
maximum number of different/unique words/terms per field. You should
think in "streams" and byte volumes per second you put on the machine.

So, if I read this, for a given index size and document, its impact is
more measured in the total I/O throughput it will incur on the index
depending on what content of it is being indexed within the index itself.
In my situation, once my index exceeded my heap space, it went to slow
disk land and that's where the things got challenging with merges and
overall I/O demands for each indexing operation.

So in my indexing _settings I set - i set a 5mb store level throttle, and
yet things still were taxing. Is that the indicator that my disks were
terrible? That it was set to 5mb and it still crushed my I/O?

Thanks again!
Dan

On Tuesday, February 26, 2013 5:17:37 AM UTC-5, Clinton Gormley wrote:

To add to what Jörg said

First, your disk input/output subsystem seems too slow for your
requirements, think about faster disk writes. This is the slowest part
of the whole system and you will notice it can help quickly without
changing code.

Merges happen all the time in the background, and usually they are
pretty light (3 small segments get merged into a bigger, but still small
segment).

Unfortunately, when you have 3 big segments getting merged into eg a 5GB
segment, that can use a lot of I/O, which impacts the performance of the
system.

If you can afford SSDs, definitely go for it. Otherwise, you will want
to throttle the speed of merges -- currently they happen as fast as your
system allows.

Take a look at "store level throttling" on
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/index-modules/store.html

Also, only leaving 25% of your RAM for the filesystem caches is
insufficient. We recommend 50% of RAM for the ES_HEAP and 50% for the
system. The file system caches help a lot with disk access speed

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, same 75% RAM of 8GB, and same merge factor.

On Tue, Feb 26, 2013 at 12:48 PM, Ivan Brusic ivan@brusic.com wrote:

Does your 5 shard testbed also have the same JVM memory settings (6GB out
of 8GB)? The same merge factor? A higher merge factor is great when you
have the memory. Once you move out of indexing testing and start searcing
against the index (while indexing), the memory pressures will become even
more evident.

--
Ivan

On Tue, Feb 26, 2013 at 9:44 AM, dmyung dmyung@dimagi.com wrote:

Thanks for all information. I figured it was the merge operation going.
In my other tests and during my testing of Sematext's SPM, CPU and total
system memory usage were spiking like crazy during the slow times.

What was curious was during my 1 shard set, total OS memory usage was
low, hence my confusion.

I initially operated on the 50% RAM setup when I ran into the initial
merge operation - so I bumped it to 75% - I'll flip it back to 50% to see
if there are changes in its behavior.

You can't estimate bulk cost only from counting documents or looking at

the mapping. Right, there are a few analyzers known for bad performance,
but this does not include most of the other analyzers, so mapping is in
most cases not critical for overall performance, unless you're doing
really wild things. A better measure for estimating costs of indexing a
document is the size of the existing index, the size of the document,
how frequent terms occur in the index and in the document, and the
maximum number of different/unique words/terms per field. You should
think in "streams" and byte volumes per second you put on the machine.

So, if I read this, for a given index size and document, its impact is
more measured in the total I/O throughput it will incur on the index
depending on what content of it is being indexed within the index itself.
In my situation, once my index exceeded my heap space, it went to slow
disk land and that's where the things got challenging with merges and
overall I/O demands for each indexing operation.

So in my indexing _settings I set - i set a 5mb store level throttle, and
yet things still were taxing. Is that the indicator that my disks were
terrible? That it was set to 5mb and it still crushed my I/O?

Thanks again!
Dan

On Tuesday, February 26, 2013 5:17:37 AM UTC-5, Clinton Gormley wrote:

To add to what Jörg said

First, your disk input/output subsystem seems too slow for your
requirements, think about faster disk writes. This is the slowest part
of the whole system and you will notice it can help quickly without
changing code.

Merges happen all the time in the background, and usually they are
pretty light (3 small segments get merged into a bigger, but still small
segment).

Unfortunately, when you have 3 big segments getting merged into eg a 5GB
segment, that can use a lot of I/O, which impacts the performance of the
system.

If you can afford SSDs, definitely go for it. Otherwise, you will want
to throttle the speed of merges -- currently they happen as fast as your
system allows.

Take a look at "store level throttling" on
Elasticsearch Platform — Find real-time answers at scale | Elastic**
store.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/store.html

Also, only leaving 25% of your RAM for the filesystem caches is
insufficient. We recommend 50% of RAM for the ES_HEAP and 50% for the
system. The file system caches help a lot with disk access speed

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Did you check in the logs if the setting was successfully applied? I
only know this syntax (at index creation)

curl -XPUT 'localhost:9200/myindex' -d '
{
"index.store.throttle.type": "merge",
"index.store.throttle.max_bytes_per_sec": "5mb"
}
'

Jörg

Am 26.02.13 18:44, schrieb dmyung:

So in my indexing _settings I set - i set a 5mb store level throttle,
and yet things still were taxing. Is that the indicator that my disks
were terrible? That it was set to 5mb and it still crushed my I/O?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-26 at 19:29 +0100, Jörg Prante wrote:

Did you check in the logs if the setting was successfully applied? I
only know this syntax (at index creation)

curl -XPUT 'localhost:9200/myindex' -d '
{
"index.store.throttle.type": "merge",
"index.store.throttle.max_bytes_per_sec": "5mb"
}
'

Also note that 5mb may still be too much - experiment with it

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, curling it shows the settings are present - will experiment with a
more restrictive throttle - thanks!

On Tue, Feb 26, 2013 at 1:41 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2013-02-26 at 19:29 +0100, Jörg Prante wrote:

Did you check in the logs if the setting was successfully applied? I
only know this syntax (at index creation)

curl -XPUT 'localhost:9200/myindex' -d '
{
"index.store.throttle.type": "merge",
"index.store.throttle.max_bytes_per_sec": "5mb"
}
'

Also note that 5mb may still be too much - experiment with it

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So, to revisit this thread (and having learned a lot on other issues on the
group) - I thought I'd give a report of our next steps, as our admittedly
unusual ES use case may hopefully be of some intellectual interest.

Revisiting the problem:
Our CASE index, on a 2x8GB cluster, trying to index 500,000 documents,
about halfway through bulk indexing we'd start getting exceedingly high
indexing times due to what ultimately ended up being merge operations. What
was once 200-300 index times turned into 10-15 minute waits for 500 doc
bursts. Indexing operations were multi-day ordeals tuning our timeout and
retry thresholds as well as tuning cluster fault detection parameters to
not have one node fall off due to being so overwhelmed. Even _status and
other diagnostic api calls on our clusters were taking 10-30 seconds to get
returns.

The actual index and our data

  • The CASE index is on a 5 shard setup.
  • 525,000 documents
  • 26.5 GB, default_analyzer
  • We dump our json data (from couchdb) unmodified into ES.
  • When querying we actually use _source to pull the data back out and
    serve it up (so we'd like to not fiddle/modify the doc on the way to ES)

In this version of the index, we had a total of 788 unique types inside
that index.
In my wisdom (at the time), I thought I needed to separate out by customer
and project the types so as to keep pure all properties in the off chance
that one might make one a datetime field and another customer would make it
a string. Ultimately, I didn't even make that distinction, and set {
dynamic: true, date_detection: false}

Anyway, some numbers:

  • 788 types
  • 41 average properties per type - 24 of which are common across all
    docs (5 of them being datetimes).
  • std dev of type's properties was 28.
  • Total unique property names: 5500
  • (no numbers on what the distribution of docs with tons of properties
    vs. docs with few properties are - can try to get them if people are
    curious)
  • For our needs we changed the tokenizer to whitespace - it reduced the
    total index size in half to 13 GB, but still was crushing our system when
    trying to index

One of the indexed properties was an array of transactions that I mapped 3
datetime properties.
All the rest of the fields were string types.

A call to the index's _mapping returned us a 20MB mapping file :slight_smile:

We kept on crushing our initial single 8GB node when trying to index, and
yet adding a 2nd node to the cluster only prolonged the inevitable crush by
say, 50-100,000 documents later.

The Solution: Scrap and start over
*
*
We ultimately decided to scrap our fully dynamic mapping on ALL these
properties due to the obvious scale issue, but a re-examination of desired
use of it. The 90% rule applied for us here in that 90% of uses for our
index (in our current service deployment) would only require queries off
the globally shared 24 properties mentioned above). Also, since we were
being rather blunt with our mapping and setting ALL properties to strings,
it made sense to make all the properties fully mapped out on a per-customer
basis based on need. That would eventually let us craft a better mapping
(or until we can automate getting all the properties correctly typed ahead
of time)

So, we made 1 type for the CASE index, the CASE_TYPE, for those 24 shared
properties, and { dynamic: False }, index size, 1.5GB, and it went
lightning fast, the limiting agent was our speed in getting the docs out of
the database. I no longer have envy of the seemingly otherwordly indexing
speeds I read about on this group!

We made a 2nd index called FULL_CASES for customers that have fully defined
property mappings set for us and we do set { dynamic: true } along with
pre-injecting the mapping with all the types ahead of indexing the first
doc. As this customer count is still small, the index for FULL_CASES is
small and manageable .... for now.

The Future
The question does come up though, if we do ramp up and need to completely
map out all properties for more customers, how can we prepare for this? If
we reach the numbers above again, we'll crush our system. To reiterate, we
think we need to keep unique customer CASE_TYPES in unique types. Customer
X might say last_encounter is a string "at the bowling alley", but Customer
Y might say last_encounter is a datetime. We were told that this was
slightly outside the norm of ES usage - if there are suggested workarounds
or things for us to tune differently we certainly are all ears.

Thanks,

Dan

On Tue, Feb 26, 2013 at 1:46 PM, Daniel Myung dmyung@dimagi.com wrote:

Yes, curling it shows the settings are present - will experiment with a
more restrictive throttle - thanks!

On Tue, Feb 26, 2013 at 1:41 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2013-02-26 at 19:29 +0100, Jörg Prante wrote:

Did you check in the logs if the setting was successfully applied? I
only know this syntax (at index creation)

curl -XPUT 'localhost:9200/myindex' -d '
{
"index.store.throttle.type": "merge",
"index.store.throttle.max_bytes_per_sec": "5mb"
}
'

Also note that 5mb may still be too much - experiment with it

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A call to the index's _mapping returned us a 20MB mapping file :slight_smile:

WOW!

So, we made 1 type for the CASE index, the CASE_TYPE, for those 24
shared properties, and { dynamic: False }, index size, 1.5GB, and it
went lightning fast, the limiting agent was our speed in getting the
docs out of the database. I no longer have envy of the seemingly
otherwordly indexing speeds I read about on this group!

heh :slight_smile:

The Future

The question does come up though, if we do ramp up and need to
completely map out all properties for more customers, how can we
prepare for this? If we reach the numbers above again, we'll crush our
system. To reiterate, we think we need to keep unique customer
CASE_TYPES in unique types. Customer X might say last_encounter is a
string "at the bowling alley", but Customer Y might say last_encounter
is a datetime. We were told that this was slightly outside the norm of
ES usage - if there are suggested workarounds or things for us to tune
differently we certainly are all ears.

Yeah, your two choices are:

  1. different indexes
  2. name the fields differently

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.