High CPU use and CouchDB river outage at times

Hi guys,

We are running ElasticSearch version 0.20.5 on two nodes. Right now we
store around 95 million documents in it. We are getting a lot of new
documents each day, so we are trimming the dataset by deleting outdated
data. Each day it deletes around 2M of old documents. I noticed
deleted_docs property (in docs object of index status) is growing while we
are deleting. We had couple of incidents lately during which our river that
copies data from CouchDB to ElasticSearch would stop getting new data in.
When we SSH into servers and check, it shows really high CPU use (up to
900% for ES service). I restarted service on both nodes and after it system
got back into normal (from yellow state of the cluster, back to green, CPU
use dropped and got back to normal). I also noticed that deleted_docs was
cut in half after I restarted service.

In a separate topic I asked about running _optimize and others explained
that running it manually is not really needed. We let the ES to do merges
based on internal logic. Is it possible that when ES decides it is time to
optimize/compact it causes a big load which would cause effects as I
described? If it is the case, is there a way for us to control how this
process goes, to avoid the big impact this has on the system (if it really
is a reason)? It is necessary for us to remain online during all times, so
we are trying to determine if the way we are removing outdated data is
killing our ES nodes and if there is a better way to do it.

Thanks a lot for your time!
Milan Gornik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

As you receive data on a daily basis, how about using index aliases for
the indices containing daily data? So, you will not need to delete
documents, which is every expensive. With index alias, you can just
re-assign the alias to the date period you actually want, and you can
safely and quickly drop obsolete indices.

Jörg

Am 27.02.13 15:59, schrieb Milan Gornik:

In a separate topic I asked about running _optimize and others
explained that running it manually is not really needed. We let the ES
to do merges based on internal logic. Is it possible that when ES
decides it is time to optimize/compact it causes a big load which
would cause effects as I described? If it is the case, is there a way
for us to control how this process goes, to avoid the big impact this
has on the system (if it really is a reason)? It is necessary for us
to remain online during all times, so we are trying to determine if
the way we are removing outdated data is killing our ES nodes and if
there is a better way to do it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

I'm not sure if that would work for our use-case (I didn't provide enough
details in the first post). We have a working set of last 40 days. That's
the current data that we are using. And we are getting new records all the
time (in the first post I probably oversimplified that). System is used
24/7 and the new records are coming in constantly. We have a background
script that runs each day with the same schedule and it selects everything
older than 40 days and removes it. So, I'm not sure how we would achieve
chopping off these expired documents with using indexes and aliases. I
guess we would need to have separate index for each day and then one alias
that joins all these indexes which are current (so every days except those
older than 40 days). Then in the removal script, we would need to drop
older indexes and recreate alias to include only current indexes. Is this
possible?

Thanks!
Milan

On Wednesday, February 27, 2013 4:47:14 PM UTC+1, Jörg Prante wrote:

As you receive data on a daily basis, how about using index aliases for
the indices containing daily data? So, you will not need to delete
documents, which is every expensive. With index alias, you can just
re-assign the alias to the date period you actually want, and you can
safely and quickly drop obsolete indices.

Jörg

Am 27.02.13 15:59, schrieb Milan Gornik:

In a separate topic I asked about running _optimize and others
explained that running it manually is not really needed. We let the ES
to do merges based on internal logic. Is it possible that when ES
decides it is time to optimize/compact it causes a big load which
would cause effects as I described? If it is the case, is there a way
for us to control how this process goes, to avoid the big impact this
has on the system (if it really is a reason)? It is necessary for us
to remain online during all times, so we are trying to determine if
the way we are removing outdated data is killing our ES nodes and if
there is a better way to do it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In our production servers (live 24*7 like yours) we call optimize
max_num_segments setting it to 3 after updating/deleting out-of-date
documents once an hour. See Elastic guide to
optimizehttp://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html

We have noticed that this api actually needs to be called twice in
succession. You can tell if it actually succeeded (it always says it did)
by monitoring the index in the data directory--the number of files will
decrease proportionately.

Also we set index.merge.policy.segments_per_tier
and index.merge.policy.max_merge_at_once to 3 (default is 10).

We do this because we see performance advantages for querying during the
remainder of the hour. But it should also provide the benefits that you
seek, namely decreasing any large merges later in the day.

You can use the curl command forms to experiment.

Randy

On Wed, Feb 27, 2013 at 6:59 AM, Milan Gornik milan@wildbit.com wrote:

Hi guys,

We are running Elasticsearch version 0.20.5 on two nodes. Right now we
store around 95 million documents in it. We are getting a lot of new
documents each day, so we are trimming the dataset by deleting outdated
data. Each day it deletes around 2M of old documents. I noticed
deleted_docs property (in docs object of index status) is growing while we
are deleting. We had couple of incidents lately during which our river that
copies data from CouchDB to Elasticsearch would stop getting new data in.
When we SSH into servers and check, it shows really high CPU use (up to
900% for ES service). I restarted service on both nodes and after it system
got back into normal (from yellow state of the cluster, back to green, CPU
use dropped and got back to normal). I also noticed that deleted_docs was
cut in half after I restarted service.

In a separate topic I asked about running _optimize and others explained
that running it manually is not really needed. We let the ES to do merges
based on internal logic. Is it possible that when ES decides it is time to
optimize/compact it causes a big load which would cause effects as I
described? If it is the case, is there a way for us to control how this
process goes, to avoid the big impact this has on the system (if it really
is a reason)? It is necessary for us to remain online during all times, so
we are trying to determine if the way we are removing outdated data is
killing our ES nodes and if there is a better way to do it.

Thanks a lot for your time!
Milan Gornik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Randy,

Thanks a lot for sharing your experience on this topic here. One question:
have you tried modifying these parameters on live system? I am interested
to know if modifying them can cause issues for cluster which is currently
running. I also read about Store Level Throttling
here: Elasticsearch Platform — Find real-time answers at scale | Elastic,
so these two combined should be able to fully control merge process in ES.

Regards,
Milan

On Wednesday, February 27, 2013 7:04:02 PM UTC+1, RKM wrote:

In our production servers (live 24*7 like yours) we call optimize
max_num_segments setting it to 3 after updating/deleting out-of-date
documents once an hour. See Elastic guide to optimizehttp://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html

We have noticed that this api actually needs to be called twice in
succession. You can tell if it actually succeeded (it always says it did)
by monitoring the index in the data directory--the number of files will
decrease proportionately.

Also we set index.merge.policy.segments_per_tier
and index.merge.policy.max_merge_at_once to 3 (default is 10).

We do this because we see performance advantages for querying during the
remainder of the hour. But it should also provide the benefits that you
seek, namely decreasing any large merges later in the day.

You can use the curl command forms to experiment.

Randy

On Wed, Feb 27, 2013 at 6:59 AM, Milan Gornik <mi...@wildbit.com<javascript:>

wrote:

Hi guys,

We are running Elasticsearch version 0.20.5 on two nodes. Right now we
store around 95 million documents in it. We are getting a lot of new
documents each day, so we are trimming the dataset by deleting outdated
data. Each day it deletes around 2M of old documents. I noticed
deleted_docs property (in docs object of index status) is growing while we
are deleting. We had couple of incidents lately during which our river that
copies data from CouchDB to Elasticsearch would stop getting new data in.
When we SSH into servers and check, it shows really high CPU use (up to
900% for ES service). I restarted service on both nodes and after it system
got back into normal (from yellow state of the cluster, back to green, CPU
use dropped and got back to normal). I also noticed that deleted_docs was
cut in half after I restarted service.

In a separate topic I asked about running _optimize and others explained
that running it manually is not really needed. We let the ES to do merges
based on internal logic. Is it possible that when ES decides it is time to
optimize/compact it causes a big load which would cause effects as I
described? If it is the case, is there a way for us to control how this
process goes, to avoid the big impact this has on the system (if it really
is a reason)? It is necessary for us to remain online during all times, so
we are trying to determine if the way we are removing outdated data is
killing our ES nodes and if there is a better way to do it.

Thanks a lot for your time!
Milan Gornik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Milan,

I'm not sure if that would work for our use-case (I didn't provide enough details in the first post). We have a working set of last 40 days. That's the current data that we are using. And we are getting new records all the time (in the first post I probably oversimplified that). System is used 24/7 and the new records are coming in constantly. We have a background script that runs each day with the same schedule and it selects everything older than 40 days and removes it.

I think I'd probably still support Jörg's multiple-index approach.

I'd have an index per day, with a name like docs-20130228. I'd then have an alias 'docs' that pointed to the 'current' day.

I'd modify my application so that it always wrote to an index called 'docs' (the alias will route this to the correct underlying index for the day) - I guess this is actually your CouchDB river, in this case.

I'd also have a script which ran every hour, which looked at the current date/time. If we're about to move to a new day, it would create a new index (with appropriate mappings etc.) then update the 'docs' alias to point to it. The application would then transparently start writing documents to that new index.

At that point, I can figure out what the name of the index for 41-days-ago was, and delete it.

I guess something like that is what Jörg was getting at.

Remember that searches can easily span multiple indices, so you don't limit your flexibility to search across your whole 40-day history with this approach.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes we issue these commands on our live system, once an hour after a batch
index. Other than optimize being somewhat flaky (i.e. saying success when
in actuality the switch to the merged segment did not happen) all has been
well.

We have not tried the store level throttling. We do have mmapfs set to
true, however.

So, yes, there are a lot of knobs and switches you can throw.

On Thu, Feb 28, 2013 at 8:13 AM, Milan Gornik milan@wildbit.com wrote:

Hi Randy,

Thanks a lot for sharing your experience on this topic here. One question:
have you tried modifying these parameters on live system? I am interested
to know if modifying them can cause issues for cluster which is currently
running. I also read about Store Level Throttling here:
Elasticsearch Platform — Find real-time answers at scale | Elastic, so
these two combined should be able to fully control merge process in ES.

Regards,
Milan

On Wednesday, February 27, 2013 7:04:02 PM UTC+1, RKM wrote:

In our production servers (live 24*7 like yours) we call optimize
max_num_segments setting it to 3 after updating/deleting out-of-date
documents once an hour. See Elastic guide to optimizehttp://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html

We have noticed that this api actually needs to be called twice in
succession. You can tell if it actually succeeded (it always says it did)
by monitoring the index in the data directory--the number of files will
decrease proportionately.

Also we set index.merge.policy.**segments_per_tier
and index.merge.policy.max_**merge_at_once to 3 (default is 10).

We do this because we see performance advantages for querying during the
remainder of the hour. But it should also provide the benefits that you
seek, namely decreasing any large merges later in the day.

You can use the curl command forms to experiment.

Randy

On Wed, Feb 27, 2013 at 6:59 AM, Milan Gornik mi...@wildbit.com wrote:

Hi guys,

We are running Elasticsearch version 0.20.5 on two nodes. Right now we
store around 95 million documents in it. We are getting a lot of new
documents each day, so we are trimming the dataset by deleting outdated
data. Each day it deletes around 2M of old documents. I noticed
deleted_docs property (in docs object of index status) is growing while we
are deleting. We had couple of incidents lately during which our river that
copies data from CouchDB to Elasticsearch would stop getting new data in.
When we SSH into servers and check, it shows really high CPU use (up to
900% for ES service). I restarted service on both nodes and after it system
got back into normal (from yellow state of the cluster, back to green, CPU
use dropped and got back to normal). I also noticed that deleted_docs was
cut in half after I restarted service.

In a separate topic I asked about running _optimize and others explained
that running it manually is not really needed. We let the ES to do merges
based on internal logic. Is it possible that when ES decides it is time to
optimize/compact it causes a big load which would cause effects as I
described? If it is the case, is there a way for us to control how this
process goes, to avoid the big impact this has on the system (if it really
is a reason)? It is necessary for us to remain online during all times, so
we are trying to determine if the way we are removing outdated data is
killing our ES nodes and if there is a better way to do it.

Thanks a lot for your time!
Milan Gornik

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.