Intermittently slow GET requests

=== Context ===

We have a cluster with five nodes on virtual machines. The machines are
running CentOS on OpenVZ. Using ES beta2. Four of the machines have 4GB of
memory and 3 cores each. The fifth one has 6GB of memory and 4 cores. On
all machines ES is given half of the memory.

The cluster hosts two indexes. Index1 has two types, A and B. A documents
are fairly small but has one field of type string array which can hold up
to 500 elements. B documents are very small. There's a parent/child
mapping, mapping B documents as children to A documents. B documents have
ttl of 30 days, but no C document is yet that old. For both A and B there's
a dynamic mapping that maps all strings as not analyzed. Index1 has refresh
time set to 30s.

Index2 has a single type, C. C documents are fairly large and have a couple
of string fields that are analyzed using the snowball analyzer. Index2 has
refresh time set to 2s.

The application deals with 50-100 requests/second. For each request the
following happens:

  • A bulk request updating an A document and indexing a new B document. The
    A document is updated using a fairly heavy mvel script and uses upsert if
    the document doesn't already exist. The B document is always a child of the
    A document.
  • A mget request is made for one C document and one A document. The A
    document is always the same A document as the one being updated. Only a
    single fields from the C document is retrieved.
  • For about half of the request an additional mget request is made getting
    25 more C documents.
  • Also, recently we had a incident where one of the nodes run out of disk
    space. This caused some updates of A documents to fail with an EOF
    exception as the document couldn't be retrieved. We're dealing with this by
    detecting failed updates of A documents and then making an additional index
    request to overwrite the "corrupt" A document.

In addition to the above about 3 search requests are made each second and
about one C document is indexed per second.

The indexes currently hold about 5M A documents, 50M B documents and 30K C
documents.

All of the nodes are reporting 100% memory usage in the virtualization
control panel or using free, but less when using top. CPU usage is around
30%. In ES slow logs we see indexing times up to 10s every couple of
minutes. We're also seeing search times of 3-10s every 5 minutes. Both of
which are acceptable from the application's perspective.

Below is our current cluster settings:

"threadpool.index.queue_size" : "300",
"threadpool.index.queue_size2" : "300",
"threadpool.bulk.queue_size" : "300",
"threadpool.get.type" : "cached",
"threadpool.index.type" : "cached",
"threadpool.bulk.type" : "cached"

=== Problem ===

Logging on the application level shows that 10-30% of the mget requests
being made on a per request basis take 1 to 8 seconds, which isn't
acceptable. The other 70-90% seem to be really fast. When making GET
requests directly to the nodes the same pattern appears. Some requests are
done after 2-10ms while some take one or several seconds. This is true no
matter which node the requests are made to, including when making the
request from the node itself. Specifying preference=_local/_primary doesn't
have any effect.

=== Question ===

I'd like any input on the cause of the slow GET requests. The cluster
obviously has a lot to deal with, but it seems strange that the exact same
request can take only a couple of milliseconds and then take seconds a very
short while later.

I'm particularly interested in whether the best course of action would be
to:
A) Increase server resources
B) Set a lower ttl time for B documents thereby drastically decreasing the
total number of documents for the cluster
C) Any configuration changes we should make
D) Whether the "corrupt" documents for which updates fails could be the
cause of the problem

Thanks in advance for any input!

Joel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0afaeb9d-4b8b-450b-8fbc-9929b47b75b3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It could be GC.

What tools are you using for monitoring your cluster? If none then check
out elastichq, kopf and bigdesk. These will give you insight into possible
links across HTTP response times and various system and application metrics.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 December 2013 06:25, Joel Abrahamsson mail@joelabrahamsson.comwrote:

=== Context ===

We have a cluster with five nodes on virtual machines. The machines are
running CentOS on OpenVZ. Using ES beta2. Four of the machines have 4GB of
memory and 3 cores each. The fifth one has 6GB of memory and 4 cores. On
all machines ES is given half of the memory.

The cluster hosts two indexes. Index1 has two types, A and B. A documents
are fairly small but has one field of type string array which can hold up
to 500 elements. B documents are very small. There's a parent/child
mapping, mapping B documents as children to A documents. B documents have
ttl of 30 days, but no C document is yet that old. For both A and B there's
a dynamic mapping that maps all strings as not analyzed. Index1 has refresh
time set to 30s.

Index2 has a single type, C. C documents are fairly large and have a
couple of string fields that are analyzed using the snowball analyzer.
Index2 has refresh time set to 2s.

The application deals with 50-100 requests/second. For each request the
following happens:

  • A bulk request updating an A document and indexing a new B document. The
    A document is updated using a fairly heavy mvel script and uses upsert if
    the document doesn't already exist. The B document is always a child of the
    A document.
  • A mget request is made for one C document and one A document. The A
    document is always the same A document as the one being updated. Only a
    single fields from the C document is retrieved.
  • For about half of the request an additional mget request is made getting
    25 more C documents.
  • Also, recently we had a incident where one of the nodes run out of disk
    space. This caused some updates of A documents to fail with an EOF
    exception as the document couldn't be retrieved. We're dealing with this by
    detecting failed updates of A documents and then making an additional index
    request to overwrite the "corrupt" A document.

In addition to the above about 3 search requests are made each second and
about one C document is indexed per second.

The indexes currently hold about 5M A documents, 50M B documents and 30K C
documents.

All of the nodes are reporting 100% memory usage in the virtualization
control panel or using free, but less when using top. CPU usage is around
30%. In ES slow logs we see indexing times up to 10s every couple of
minutes. We're also seeing search times of 3-10s every 5 minutes. Both of
which are acceptable from the application's perspective.

Below is our current cluster settings:

"threadpool.index.queue_size" : "300",
"threadpool.index.queue_size2" : "300",
"threadpool.bulk.queue_size" : "300",
"threadpool.get.type" : "cached",
"threadpool.index.type" : "cached",
"threadpool.bulk.type" : "cached"

=== Problem ===

Logging on the application level shows that 10-30% of the mget requests
being made on a per request basis take 1 to 8 seconds, which isn't
acceptable. The other 70-90% seem to be really fast. When making GET
requests directly to the nodes the same pattern appears. Some requests are
done after 2-10ms while some take one or several seconds. This is true no
matter which node the requests are made to, including when making the
request from the node itself. Specifying preference=_local/_primary doesn't
have any effect.

=== Question ===

I'd like any input on the cause of the slow GET requests. The cluster
obviously has a lot to deal with, but it seems strange that the exact same
request can take only a couple of milliseconds and then take seconds a very
short while later.

I'm particularly interested in whether the best course of action would be
to:
A) Increase server resources
B) Set a lower ttl time for B documents thereby drastically decreasing the
total number of documents for the cluster
C) Any configuration changes we should make
D) Whether the "corrupt" documents for which updates fails could be the
cause of the problem

Thanks in advance for any input!

Joel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0afaeb9d-4b8b-450b-8fbc-9929b47b75b3%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bf8o4KOqDitGfZP5m5Z_Ozz0%3DbxE3CcNOmeZ-wNDQfqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

We're using Bigdesk. So far we haven't seen any correlation between spikes
in GET time and GC. The spikes in GET time do seem to match spikes in
search time and even though we've so far thought they have both been
effects maybe search requests are actually the cause. We can try to run the
app without searches and only GETs for a while.

Anyhow, regarding GC, any advice on how to monitor it? And, any advice on
what to do if it indeed is GC?

Thanks, Joel

On Monday, December 16, 2013 10:11:34 PM UTC+1, Mark Walkom wrote:

It could be GC.

What tools are you using for monitoring your cluster? If none then check
out elastichq, kopf and bigdesk. These will give you insight into possible
links across HTTP response times and various system and application metrics.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 17 December 2013 06:25, Joel Abrahamsson <ma...@joelabrahamsson.com<javascript:>

wrote:

=== Context ===

We have a cluster with five nodes on virtual machines. The machines are
running CentOS on OpenVZ. Using ES beta2. Four of the machines have 4GB of
memory and 3 cores each. The fifth one has 6GB of memory and 4 cores. On
all machines ES is given half of the memory.

The cluster hosts two indexes. Index1 has two types, A and B. A documents
are fairly small but has one field of type string array which can hold up
to 500 elements. B documents are very small. There's a parent/child
mapping, mapping B documents as children to A documents. B documents have
ttl of 30 days, but no C document is yet that old. For both A and B there's
a dynamic mapping that maps all strings as not analyzed. Index1 has refresh
time set to 30s.

Index2 has a single type, C. C documents are fairly large and have a
couple of string fields that are analyzed using the snowball analyzer.
Index2 has refresh time set to 2s.

The application deals with 50-100 requests/second. For each request the
following happens:

  • A bulk request updating an A document and indexing a new B document.
    The A document is updated using a fairly heavy mvel script and uses upsert
    if the document doesn't already exist. The B document is always a child of
    the A document.
  • A mget request is made for one C document and one A document. The A
    document is always the same A document as the one being updated. Only a
    single fields from the C document is retrieved.
  • For about half of the request an additional mget request is made
    getting 25 more C documents.
  • Also, recently we had a incident where one of the nodes run out of disk
    space. This caused some updates of A documents to fail with an EOF
    exception as the document couldn't be retrieved. We're dealing with this by
    detecting failed updates of A documents and then making an additional index
    request to overwrite the "corrupt" A document.

In addition to the above about 3 search requests are made each second and
about one C document is indexed per second.

The indexes currently hold about 5M A documents, 50M B documents and 30K
C documents.

All of the nodes are reporting 100% memory usage in the virtualization
control panel or using free, but less when using top. CPU usage is around
30%. In ES slow logs we see indexing times up to 10s every couple of
minutes. We're also seeing search times of 3-10s every 5 minutes. Both of
which are acceptable from the application's perspective.

Below is our current cluster settings:

"threadpool.index.queue_size" : "300",
"threadpool.index.queue_size2" : "300",
"threadpool.bulk.queue_size" : "300",
"threadpool.get.type" : "cached",
"threadpool.index.type" : "cached",
"threadpool.bulk.type" : "cached"

=== Problem ===

Logging on the application level shows that 10-30% of the mget requests
being made on a per request basis take 1 to 8 seconds, which isn't
acceptable. The other 70-90% seem to be really fast. When making GET
requests directly to the nodes the same pattern appears. Some requests are
done after 2-10ms while some take one or several seconds. This is true no
matter which node the requests are made to, including when making the
request from the node itself. Specifying preference=_local/_primary doesn't
have any effect.

=== Question ===

I'd like any input on the cause of the slow GET requests. The cluster
obviously has a lot to deal with, but it seems strange that the exact same
request can take only a couple of milliseconds and then take seconds a very
short while later.

I'm particularly interested in whether the best course of action would be
to:
A) Increase server resources
B) Set a lower ttl time for B documents thereby drastically decreasing
the total number of documents for the cluster
C) Any configuration changes we should make
D) Whether the "corrupt" documents for which updates fails could be the
cause of the problem

Thanks in advance for any input!

Joel

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0afaeb9d-4b8b-450b-8fbc-9929b47b75b3%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/56e82fc7-cfe3-43c1-a59f-24b6c4fefc34%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

try to use the hot_threads API for current high CPU load and the stats API
to get some numbers over time, especially regarding GC monitoring.

--Alex

On Mon, Dec 16, 2013 at 10:24 PM, Joel Abrahamsson <mail@joelabrahamsson.com

wrote:

We're using Bigdesk. So far we haven't seen any correlation between spikes
in GET time and GC. The spikes in GET time do seem to match spikes in
search time and even though we've so far thought they have both been
effects maybe search requests are actually the cause. We can try to run the
app without searches and only GETs for a while.

Anyhow, regarding GC, any advice on how to monitor it? And, any advice on
what to do if it indeed is GC?

Thanks, Joel

On Monday, December 16, 2013 10:11:34 PM UTC+1, Mark Walkom wrote:

It could be GC.

What tools are you using for monitoring your cluster? If none then check
out elastichq, kopf and bigdesk. These will give you insight into possible
links across HTTP response times and various system and application metrics.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 17 December 2013 06:25, Joel Abrahamsson ma...@joelabrahamsson.comwrote:

=== Context ===

We have a cluster with five nodes on virtual machines. The machines are
running CentOS on OpenVZ. Using ES beta2. Four of the machines have 4GB of
memory and 3 cores each. The fifth one has 6GB of memory and 4 cores. On
all machines ES is given half of the memory.

The cluster hosts two indexes. Index1 has two types, A and B. A
documents are fairly small but has one field of type string array which can
hold up to 500 elements. B documents are very small. There's a parent/child
mapping, mapping B documents as children to A documents. B documents have
ttl of 30 days, but no C document is yet that old. For both A and B there's
a dynamic mapping that maps all strings as not analyzed. Index1 has refresh
time set to 30s.

Index2 has a single type, C. C documents are fairly large and have a
couple of string fields that are analyzed using the snowball analyzer.
Index2 has refresh time set to 2s.

The application deals with 50-100 requests/second. For each request the
following happens:

  • A bulk request updating an A document and indexing a new B document.
    The A document is updated using a fairly heavy mvel script and uses upsert
    if the document doesn't already exist. The B document is always a child of
    the A document.
  • A mget request is made for one C document and one A document. The A
    document is always the same A document as the one being updated. Only a
    single fields from the C document is retrieved.
  • For about half of the request an additional mget request is made
    getting 25 more C documents.
  • Also, recently we had a incident where one of the nodes run out of
    disk space. This caused some updates of A documents to fail with an EOF
    exception as the document couldn't be retrieved. We're dealing with this by
    detecting failed updates of A documents and then making an additional index
    request to overwrite the "corrupt" A document.

In addition to the above about 3 search requests are made each second
and about one C document is indexed per second.

The indexes currently hold about 5M A documents, 50M B documents and 30K
C documents.

All of the nodes are reporting 100% memory usage in the virtualization
control panel or using free, but less when using top. CPU usage is around
30%. In ES slow logs we see indexing times up to 10s every couple of
minutes. We're also seeing search times of 3-10s every 5 minutes. Both of
which are acceptable from the application's perspective.

Below is our current cluster settings:

"threadpool.index.queue_size" : "300",
"threadpool.index.queue_size2" : "300",
"threadpool.bulk.queue_size" : "300",
"threadpool.get.type" : "cached",
"threadpool.index.type" : "cached",
"threadpool.bulk.type" : "cached"

=== Problem ===

Logging on the application level shows that 10-30% of the mget requests
being made on a per request basis take 1 to 8 seconds, which isn't
acceptable. The other 70-90% seem to be really fast. When making GET
requests directly to the nodes the same pattern appears. Some requests are
done after 2-10ms while some take one or several seconds. This is true no
matter which node the requests are made to, including when making the
request from the node itself. Specifying preference=_local/_primary doesn't
have any effect.

=== Question ===

I'd like any input on the cause of the slow GET requests. The cluster
obviously has a lot to deal with, but it seems strange that the exact same
request can take only a couple of milliseconds and then take seconds a very
short while later.

I'm particularly interested in whether the best course of action would
be to:
A) Increase server resources
B) Set a lower ttl time for B documents thereby drastically decreasing
the total number of documents for the cluster
C) Any configuration changes we should make
D) Whether the "corrupt" documents for which updates fails could be the
cause of the problem

Thanks in advance for any input!

Joel

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0afaeb9d-4b8b-450b-8fbc-9929b47b75b3%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/56e82fc7-cfe3-43c1-a59f-24b6c4fefc34%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8GtE-Q1yuGP2j3XHM3RdKyjSNVgQF1vpd0m_EQy9P-Xg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.