Need advice on production setup, experiencing issues with cpu usage

We are using 0.18.6 in our production ec2 environment. We have three
java web application servers setup and two search servers and two
database servers, all ec2 large instances. We originally had the
database and search servers together, but we were seeing high CPU
usage and couldn't determine the cause so we split them up. What we
are seeing with the search servers is that each day during our peak
ours (8am - 6pm) the search instances are running the CPUs at 60 -
80%. Today we had an issue where we had to restart the search servers.
The web servers were dropping connections to the search servers
causing timeouts for the users and the search servers were at 80-85%.

Here are the details:

We are averaging about 1500 users logged in at a time during our peak
hours.
We have two search instances with 0.18.6 and three search clients (one
on each web server).
Each search instance has a 200GB ebs volume that it stores the data
to.
Our application is a multi-tenant application and currently we have
about 200 tenants.
Each tenant gets its own index with the default settings (5 shards per
index, etc).
Our application logs almost every request that a user makes (database
and search).
We currently have around 10,000,000 documents between all of the
indexes
We are currently at 45,000 open file descriptors.
Our CPU usage is between 60-80% daily on both search servers (I can
send screenshots of the EC2 charts).

My thoughts:

Maybe logging each request that a user makes through our site to
elastic search is causing this?
I don't think that creating a new index for every customer is the way
to go here, but I'm sure what the best way is. We are going to hit our
64k file descriptor limit very soon.

I'm looking for ways to improve our setup. Thoughts?

It looks like you have about 1000 shards. That adds a lot of overhead to
the servers. It's way too many for amount of data you have. It is possible
to handle multi-tenant other ways, without creating a separate index per
tenant. At the very least, you don't need 5 shards per index. You can start
with setting number of shards to 1 per index. This should reduce open
files, memory and cpu consumption. Unfortunately, this will require
reindexing.

The more resource efficient way of handling multi tenancy would be storing
tenant as a field of the doc and add that to each query to limit the
results to users's document. ES provides "routing" so that you can queries
can be directed to the relevant shard only. (search mailing list and ES
docs for routing). For the servers you have you can have one index with
5-10 shards and be done with it.

You don't have how many search requests ES serves per minute, whether you
use facets, etc. but I'd say it's normal that the ES boxes are getting
overwhelmed.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Jan 19, 2012 at 1:11 PM, danpolites dpolites@gmail.com wrote:

We are using 0.18.6 in our production ec2 environment. We have three
java web application servers setup and two search servers and two
database servers, all ec2 large instances. We originally had the
database and search servers together, but we were seeing high CPU
usage and couldn't determine the cause so we split them up. What we
are seeing with the search servers is that each day during our peak
ours (8am - 6pm) the search instances are running the CPUs at 60 -
80%. Today we had an issue where we had to restart the search servers.
The web servers were dropping connections to the search servers
causing timeouts for the users and the search servers were at 80-85%.

Here are the details:

We are averaging about 1500 users logged in at a time during our peak
hours.
We have two search instances with 0.18.6 and three search clients (one
on each web server).
Each search instance has a 200GB ebs volume that it stores the data
to.
Our application is a multi-tenant application and currently we have
about 200 tenants.
Each tenant gets its own index with the default settings (5 shards per
index, etc).
Our application logs almost every request that a user makes (database
and search).
We currently have around 10,000,000 documents between all of the
indexes
We are currently at 45,000 open file descriptors.
Our CPU usage is between 60-80% daily on both search servers (I can
send screenshots of the EC2 charts).

My thoughts:

Maybe logging each request that a user makes through our site to
Elasticsearch is causing this?
I don't think that creating a new index for every customer is the way
to go here, but I'm sure what the best way is. We are going to hit our
64k file descriptor limit very soon.

I'm looking for ways to improve our setup. Thoughts?

+1 Berkay. On two (large) boxes, thats a lot of shards. The best solution
is a single index, with overallocation of shards (lets say, 20-30), but use
routing (usename, or something similar) when searching/indexing so you only
hit one shard for that index. Note, you will also need to filter the search
to run only against the username as well (so it needs to be stored in the
doc).

Aliases can help here. You can create an alias for each new user, the alias
will be the username, with a routing value and a filter (based on the
username) associated with it. Then, when you index against that alias, it
will automatically apply the routing, and when you search, it will
automatically apply the routing and the filter. More info here:
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Thu, Jan 19, 2012 at 9:17 PM, Berkay Mollamustafaoglu
mberkay@gmail.comwrote:

It looks like you have about 1000 shards. That adds a lot of overhead to
the servers. It's way too many for amount of data you have. It is possible
to handle multi-tenant other ways, without creating a separate index per
tenant. At the very least, you don't need 5 shards per index. You can start
with setting number of shards to 1 per index. This should reduce open
files, memory and cpu consumption. Unfortunately, this will require
reindexing.

The more resource efficient way of handling multi tenancy would be storing
tenant as a field of the doc and add that to each query to limit the
results to users's document. ES provides "routing" so that you can queries
can be directed to the relevant shard only. (search mailing list and ES
docs for routing). For the servers you have you can have one index with
5-10 shards and be done with it.

You don't have how many search requests ES serves per minute, whether you
use facets, etc. but I'd say it's normal that the ES boxes are getting
overwhelmed.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Jan 19, 2012 at 1:11 PM, danpolites dpolites@gmail.com wrote:

We are using 0.18.6 in our production ec2 environment. We have three
java web application servers setup and two search servers and two
database servers, all ec2 large instances. We originally had the
database and search servers together, but we were seeing high CPU
usage and couldn't determine the cause so we split them up. What we
are seeing with the search servers is that each day during our peak
ours (8am - 6pm) the search instances are running the CPUs at 60 -
80%. Today we had an issue where we had to restart the search servers.
The web servers were dropping connections to the search servers
causing timeouts for the users and the search servers were at 80-85%.

Here are the details:

We are averaging about 1500 users logged in at a time during our peak
hours.
We have two search instances with 0.18.6 and three search clients (one
on each web server).
Each search instance has a 200GB ebs volume that it stores the data
to.
Our application is a multi-tenant application and currently we have
about 200 tenants.
Each tenant gets its own index with the default settings (5 shards per
index, etc).
Our application logs almost every request that a user makes (database
and search).
We currently have around 10,000,000 documents between all of the
indexes
We are currently at 45,000 open file descriptors.
Our CPU usage is between 60-80% daily on both search servers (I can
send screenshots of the EC2 charts).

My thoughts:

Maybe logging each request that a user makes through our site to
Elasticsearch is causing this?
I don't think that creating a new index for every customer is the way
to go here, but I'm sure what the best way is. We are going to hit our
64k file descriptor limit very soon.

I'm looking for ways to improve our setup. Thoughts?

Thanks guys. That's definitely the issue. Until we get some time to
change our code, we brought up two more instances to help offload some
of the work. Just doing that took our open files down to 20k. The cpu
usage is under 10 percent at the moment, but we are not into our day
yet. Hopefully they stay down for us. Thanks again!