High cpu usage on large ec2 nodes

Hi,

I'm pretty new to elasticsearch, though i have extensively used lucene.
We are currently migrating from lucene to elasticsearch in our project.

We create a basic elasticsearch setup on AWS cloud and are trying to test
the performance of the same.

The configuration:
EC2 Nodes - 2 Large nodes
Shards - 5
Replication - 1
Memory settings - 4GB

We have created a basic index whose size is about 7GB. For the performance
tests, we have pretty much maintained a constant index, ie., the index is
not getting updated. There are no index events to the elasticsearch server.

Not we are bombarding *each *elasticsearch node with about 100 search
requets per sec (using a single jmeter client for this). Each search query
is a boolean query with 5-6 term query criteria.

For this load the CPU utilization is going upto 75%. The performance of
each query is still good. One query took about* 90ms* to return the result.

We then reduced the shards to 3 and ran the same tests.
The CPU usage remained the same but the performance degraded. Now each
request took about 180ms to return the result.

We expected the results to improve since we reduced the number of shards.
Not the opposite happened. Is this the expected result.
And is the high CPU usage also expected?

Thanks
Rohit

--

In general, yes, decreasing the number of shards should improve search
performance (less Lucene indices to search against), but I suspect in your
benchmarking scenario, there are many variables and it's hard to keep them
consistent:

  • The m1.large instance type is quite small, in a sense it has lot of
    "neighbours" -- you never know who is doing what in the same rack
  • The m2.xlarge is better in this sense, and also allows you to use the
    high I/O EBS volumes
  • A lot depends on the disk used for ES -- are you using the EBS-backed
    instance disk? The "physical" ephemeral disk for the instance? Extra EBS
    volume, possibly IOPS?
  • Regarding the CPU, I'd say it's expected you'll saturate the resources of
    the machine at one point, and ~100 req/sec sounds kinda OK to me for the
    type of machine in question. You can use the hot_threads API to check
    where the time is
    spent: Elasticsearch Platform — Find real-time answers at scale | Elastic

Karel

On Monday, January 28, 2013 6:52:19 PM UTC+1, rohit reddy wrote:

Hi,

I'm pretty new to elasticsearch, though i have extensively used lucene.
We are currently migrating from lucene to elasticsearch in our project.

We create a basic elasticsearch setup on AWS cloud and are trying to test
the performance of the same.

The configuration:
EC2 Nodes - 2 Large nodes
Shards - 5
Replication - 1
Memory settings - 4GB

We have created a basic index whose size is about 7GB. For the performance
tests, we have pretty much maintained a constant index, ie., the index is
not getting updated. There are no index events to the elasticsearch server.

Not we are bombarding *each *elasticsearch node with about 100 search
requets per sec (using a single jmeter client for this). Each search query
is a boolean query with 5-6 term query criteria.

For this load the CPU utilization is going upto 75%. The performance of
each query is still good. One query took about* 90ms* to return the
result.

We then reduced the shards to 3 and ran the same tests.
The CPU usage remained the same but the performance degraded. Now each
request took about 180ms to return the result.

We expected the results to improve since we reduced the number of shards.
Not the opposite happened. Is this the expected result.
And is the high CPU usage also expected?

Thanks
Rohit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We are using ephemeral disk with s3 backup. Since we expect the performance
of ephemeral disk to be better than EBS. And since our index does not get
updated too frequently, the overhead of storing backups in S3 is not huge.

I'll see use the API and try to identify which resource is taking up the
CPU.

Thanks
Rohit

On Tuesday, January 29, 2013 1:12:34 PM UTC+5:30, Karel Minařík wrote:

In general, yes, decreasing the number of shards should improve search
performance (less Lucene indices to search against), but I suspect in your
benchmarking scenario, there are many variables and it's hard to keep them
consistent:

  • The m1.large instance type is quite small, in a sense it has lot of
    "neighbours" -- you never know who is doing what in the same rack
  • The m2.xlarge is better in this sense, and also allows you to use the
    high I/O EBS volumes
  • A lot depends on the disk used for ES -- are you using the EBS-backed
    instance disk? The "physical" ephemeral disk for the instance? Extra EBS
    volume, possibly IOPS?
  • Regarding the CPU, I'd say it's expected you'll saturate the resources
    of the machine at one point, and ~100 req/sec sounds kinda OK to me for the
    type of machine in question. You can use the hot_threads API to check
    where the time is spent:
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Karel

On Monday, January 28, 2013 6:52:19 PM UTC+1, rohit reddy wrote:

Hi,

I'm pretty new to elasticsearch, though i have extensively used lucene.
We are currently migrating from lucene to elasticsearch in our project.

We create a basic elasticsearch setup on AWS cloud and are trying to test
the performance of the same.

The configuration:
EC2 Nodes - 2 Large nodes
Shards - 5
Replication - 1
Memory settings - 4GB

We have created a basic index whose size is about 7GB. For the
performance tests, we have pretty much maintained a constant index, ie.,
the index is not getting updated. There are no index events to the
elasticsearch server.

Not we are bombarding *each *elasticsearch node with about 100 search
requets per sec (using a single jmeter client for this). Each search query
is a boolean query with 5-6 term query criteria.

For this load the CPU utilization is going upto 75%. The performance of
each query is still good. One query took about* 90ms* to return the
result.

We then reduced the shards to 3 and ran the same tests.
The CPU usage remained the same but the performance degraded. Now each
request took about 180ms to return the result.

We expected the results to improve since we reduced the number of shards.
Not the opposite happened. Is this the expected result.
And is the high CPU usage also expected?

Thanks
Rohit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Attached the hot-thread snapshot using the elasticsearch api.
I'm using DFS_QUERY_THEN_FETCH for the search.

Seems like most of the threads on waiting on reading from lucene index. Is
the normal? or should i tweek some configurations to reduce this. Using all
defaults for now.

On Thursday, January 31, 2013 11:37:19 PM UTC+5:30, rohit reddy wrote:

We are using ephemeral disk with s3 backup. Since we expect the
performance of ephemeral disk to be better than EBS. And since our index
does not get updated too frequently, the overhead of storing backups in S3
is not huge.

I'll see use the API and try to identify which resource is taking up the
CPU.

Thanks
Rohit

On Tuesday, January 29, 2013 1:12:34 PM UTC+5:30, Karel Minařík wrote:

In general, yes, decreasing the number of shards should improve search
performance (less Lucene indices to search against), but I suspect in your
benchmarking scenario, there are many variables and it's hard to keep them
consistent:

  • The m1.large instance type is quite small, in a sense it has lot of
    "neighbours" -- you never know who is doing what in the same rack
  • The m2.xlarge is better in this sense, and also allows you to use the
    high I/O EBS volumes
  • A lot depends on the disk used for ES -- are you using the EBS-backed
    instance disk? The "physical" ephemeral disk for the instance? Extra EBS
    volume, possibly IOPS?
  • Regarding the CPU, I'd say it's expected you'll saturate the resources
    of the machine at one point, and ~100 req/sec sounds kinda OK to me for the
    type of machine in question. You can use the hot_threads API to check
    where the time is spent:
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Karel

On Monday, January 28, 2013 6:52:19 PM UTC+1, rohit reddy wrote:

Hi,

I'm pretty new to elasticsearch, though i have extensively used lucene.
We are currently migrating from lucene to elasticsearch in our project.

We create a basic elasticsearch setup on AWS cloud and are trying to
test the performance of the same.

The configuration:
EC2 Nodes - 2 Large nodes
Shards - 5
Replication - 1
Memory settings - 4GB

We have created a basic index whose size is about 7GB. For the
performance tests, we have pretty much maintained a constant index, ie.,
the index is not getting updated. There are no index events to the
elasticsearch server.

Not we are bombarding *each *elasticsearch node with about 100 search
requets per sec (using a single jmeter client for this). Each search query
is a boolean query with 5-6 term query criteria.

For this load the CPU utilization is going upto 75%. The performance of
each query is still good. One query took about* 90ms* to return the
result.

We then reduced the shards to 3 and ran the same tests.
The CPU usage remained the same but the performance degraded. Now each
request took about 180ms to return the result.

We expected the results to improve since we reduced the number of
shards. Not the opposite happened. Is this the expected result.
And is the high CPU usage also expected?

Thanks
Rohit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It seems like its waiting most of the time on read. Which instance type of AWS are you using? Make sure to have ~50% of the memory allocated to ES (ES_HEAP_SIZE), and the other half to the OS.

Also, which java version are you using? Make sure you are on the latest 1.6 (update 34 and above) or 1.7. This makes a big difference and in older Linux distro, the default java provided is pretty old (4 years old).

I will shy away from the DFS_ type, typically, you don't really need it with big enough data set.

Last, the reason why more shards performed better is because, even on 2 nodes, each search request was being parallelized across more shards (and less data). Note, if you start running concurrent client tests, make sure to configure the search thread pool with a fixed thread size of about 4 times the CPUs you have, so it won't overflow the concurrent execution.

On Feb 7, 2013, at 10:01 AM, rohit reddy rohit.kommareddy@gmail.com wrote:

Attached the hot-thread snapshot using the elasticsearch api.
I'm using DFS_QUERY_THEN_FETCH for the search.

Thread stack - elasticsearch · GitHub

Seems like most of the threads on waiting on reading from lucene index. Is the normal? or should i tweek some configurations to reduce this. Using all defaults for now.

On Thursday, January 31, 2013 11:37:19 PM UTC+5:30, rohit reddy wrote:
We are using ephemeral disk with s3 backup. Since we expect the performance of ephemeral disk to be better than EBS. And since our index does not get updated too frequently, the overhead of storing backups in S3 is not huge.

I'll see use the API and try to identify which resource is taking up the CPU.

Thanks
Rohit

On Tuesday, January 29, 2013 1:12:34 PM UTC+5:30, Karel Minařík wrote:
In general, yes, decreasing the number of shards should improve search performance (less Lucene indices to search against), but I suspect in your benchmarking scenario, there are many variables and it's hard to keep them consistent:

  • The m1.large instance type is quite small, in a sense it has lot of "neighbours" -- you never know who is doing what in the same rack
  • The m2.xlarge is better in this sense, and also allows you to use the high I/O EBS volumes
  • A lot depends on the disk used for ES -- are you using the EBS-backed instance disk? The "physical" ephemeral disk for the instance? Extra EBS volume, possibly IOPS?
  • Regarding the CPU, I'd say it's expected you'll saturate the resources of the machine at one point, and ~100 req/sec sounds kinda OK to me for the type of machine in question. You can use the hot_threads API to check where the time is spent: Elasticsearch Platform — Find real-time answers at scale | Elastic

Karel

On Monday, January 28, 2013 6:52:19 PM UTC+1, rohit reddy wrote:
Hi,

I'm pretty new to elasticsearch, though i have extensively used lucene.
We are currently migrating from lucene to elasticsearch in our project.

We create a basic elasticsearch setup on AWS cloud and are trying to test the performance of the same.

The configuration:
EC2 Nodes - 2 Large nodes
Shards - 5
Replication - 1
Memory settings - 4GB

We have created a basic index whose size is about 7GB. For the performance tests, we have pretty much maintained a constant index, ie., the index is not getting updated. There are no index events to the elasticsearch server.

Not we are bombarding each elasticsearch node with about 100 search requets per sec (using a single jmeter client for this). Each search query is a boolean query with 5-6 term query criteria.

For this load the CPU utilization is going upto 75%. The performance of each query is still good. One query took about 90ms to return the result.

We then reduced the shards to 3 and ran the same tests.
The CPU usage remained the same but the performance degraded. Now each request took about 180ms to return the result.

We expected the results to improve since we reduced the number of shards. Not the opposite happened. Is this the expected result.
And is the high CPU usage also expected?

Thanks
Rohit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.