CPU and memory usage suddenly spiral out of control


(Eric Mill) #1

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Shay Banon) #2

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS? Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of control? I have the logs, running at DEBUG level, from an example incident. The same query gets executed, and on the 10th or so run, the memory starts piling up, and top shows the process taking anywhere from 100 to 177% of the CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an index of about 17,000 documents, where each document ranges from a few kilobytes to a few megabytes (not many of those). The data directory takes up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Eric Mill) #3
  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be misbehaving.
    I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to
    swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have
    problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS?
    Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Shay Banon) #4

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more likely they are hurt by "virtualization noise", also, make sure to use ones with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about it as well (from people using redis) on ec2, but did not encounter it with elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS? Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of control? I have the logs, running at DEBUG level, from an example incident. The same query gets executed, and on the 10th or so run, the memory starts piling up, and top shows the process taking anywhere from 100 to 177% of the CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an index of about 17,000 documents, where each document ranges from a few kilobytes to a few megabytes (not many of those). The data directory takes up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Eric Mill) #5

It's an m1.large, which has high I/O. I'm using the latest version,
elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more
    likely they are hurt by "virtualization noise", also, make sure to use ones
    with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about
it as well (from people using redis) on ec2, but did not encounter it with
elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be
    misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to
    swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have
    problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS?
    Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Eric Mill) #6

OK, I've got a lot more info about the problem. I've now isolated it to a
query, not a machine. EC2 isn't the problem, as I first thought it was. I
can trigger ElasticSearch to go into a death spiral (where it spikes the CPU
and stays there until kill -9'd) the first time this query is run on this
set of documents.

The query is a dis_max on a set of 6 "text" queries of type "phrase", each
for the term "cap and trade". Highlighting is enabled on this query, and
when I disable highlighting, the query works rapidly and doesn't kill
ElasticSearch.

It also only kills ElasticSearch when some particular document or documents
are present. I've tried cutting out large swathes of documents, and this
makes the query work okay. On the full set, the query reliably kills
ElasticSearch.

The "and" is causing the query to become more complex, because despite the
fact that it's a phrase query, I think the "and" is getting treated as a
boolean operator. I tried "health and care", for example, and this sent
ElasticSearch into a mini-spiral that it managed to recover from, but
returned me results with "health care" as a phrase present, but where each
word was highlighted separately. Possibly it's just getting removed as a
stop word, but I would think that the highlighting would apply to the phrase
as a whole, in that case.

So my questions now are:

  • Is this enough information to hazard any guesses about why ElasticSearch
    might spiral out of control?

  • How can I make an exact phrase query on a text field but not have "and"
    considered a boolean operator?

-- Eric

On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill kprojection@gmail.com wrote:

It's an m1.large, which has high I/O. I'm using the latest version,
elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more
    likely they are hurt by "virtualization noise", also, make sure to use ones
    with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about
it as well (from people using redis) on ec2, but did not encounter it with
elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be
    misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not
    to swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have
    problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS?
    Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only
on my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Shay Banon) #7
  • text query does not have boolean operators in it. It simply takes the text, analyzes it, and returns. and is a stopword, thats why you don't see it.
  • how do you do highlighting? Are term vectors stored for the field you highlight on? Is it also stored?
  • How big are the document and field that you try and highlight on?

On Thursday, June 16, 2011 at 2:47 AM, Eric Mill wrote:

OK, I've got a lot more info about the problem. I've now isolated it to a query, not a machine. EC2 isn't the problem, as I first thought it was. I can trigger ElasticSearch to go into a death spiral (where it spikes the CPU and stays there until kill -9'd) the first time this query is run on this set of documents.

The query is a dis_max on a set of 6 "text" queries of type "phrase", each for the term "cap and trade". Highlighting is enabled on this query, and when I disable highlighting, the query works rapidly and doesn't kill ElasticSearch.

It also only kills ElasticSearch when some particular document or documents are present. I've tried cutting out large swathes of documents, and this makes the query work okay. On the full set, the query reliably kills ElasticSearch.

The "and" is causing the query to become more complex, because despite the fact that it's a phrase query, I think the "and" is getting treated as a boolean operator. I tried "health and care", for example, and this sent ElasticSearch into a mini-spiral that it managed to recover from, but returned me results with "health care" as a phrase present, but where each word was highlighted separately. Possibly it's just getting removed as a stop word, but I would think that the highlighting would apply to the phrase as a whole, in that case.

So my questions now are:

  • Is this enough information to hazard any guesses about why ElasticSearch might spiral out of control?

  • How can I make an exact phrase query on a text field but not have "and" considered a boolean operator?

-- Eric

On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill <kprojection@gmail.com (mailto:kprojection@gmail.com)> wrote:

It's an m1.large, which has high I/O. I'm using the latest version, elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more likely they are hurt by "virtualization noise", also, make sure to use ones with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about it as well (from people using redis) on ec2, but did not encounter it with elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS? Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of control? I have the logs, running at DEBUG level, from an example incident. The same query gets executed, and on the 10th or so run, the memory starts piling up, and top shows the process taking anywhere from 100 to 177% of the CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an index of about 17,000 documents, where each document ranges from a few kilobytes to a few megabytes (not many of those). The data directory takes up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Eric Mill) #8

On Thu, Jun 16, 2011 at 7:11 AM, Shay Banon shay.banon@elasticsearch.comwrote:

  • how do you do highlighting? Are term vectors stored for the field you
    highlight on? Is it also stored?

It's all the defaults; I haven't specified a mapping file. The _mapping
endpoint simply says {type: "string"} for the field in question.

Would term_vector highlighting address the problem? I don't mind a larger
index size. Is there any other factor I should keep in mind when deciding
whether to use term vectors?

I'm new to the world of Lucene and document searching, so it's likely I'm
asking naive questions here.

  • How big are the document and field that you try and highlight on?

In the mapping being searched, there's at least one matching document that
is 15MB. The field being searched takes up nearly all that space. There may
be other matches, but that's the only humongous one.

Thanks again for your help.

-- Eric

On Thursday, June 16, 2011 at 2:47 AM, Eric Mill wrote:

OK, I've got a lot more info about the problem. I've now isolated it to a
query, not a machine. EC2 isn't the problem, as I first thought it was. I
can trigger ElasticSearch to go into a death spiral (where it spikes the CPU
and stays there until kill -9'd) the first time this query is run on this
set of documents.

The query is a dis_max on a set of 6 "text" queries of type "phrase", each
for the term "cap and trade". Highlighting is enabled on this query, and
when I disable highlighting, the query works rapidly and doesn't kill
ElasticSearch.

It also only kills ElasticSearch when some particular document or documents
are present. I've tried cutting out large swathes of documents, and this
makes the query work okay. On the full set, the query reliably kills
ElasticSearch.

The "and" is causing the query to become more complex, because despite the
fact that it's a phrase query, I think the "and" is getting treated as a
boolean operator. I tried "health and care", for example, and this sent
ElasticSearch into a mini-spiral that it managed to recover from, but
returned me results with "health care" as a phrase present, but where each
word was highlighted separately. Possibly it's just getting removed as a
stop word, but I would think that the highlighting would apply to the phrase
as a whole, in that case.

So my questions now are:

  • Is this enough information to hazard any guesses about why ElasticSearch
    might spiral out of control?

  • How can I make an exact phrase query on a text field but not have "and"
    considered a boolean operator?

-- Eric

On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill kprojection@gmail.com wrote:

It's an m1.large, which has high I/O. I'm using the latest version,
elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more
    likely they are hurt by "virtualization noise", also, make sure to use ones
    with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about
it as well (from people using redis) on ec2, but did not encounter it with
elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be
    misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to
    swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have
    problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS?
    Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Shay Banon) #9

Yes, if you enable term vector highlighting, then it will be faster to highlight that field. (you will need to reindex the data for that), also, if you store that field, then there won't be a need to load it from _source and parse it. I suggest creating a simple one node cluster with several of those big documents, and testing it separately.

On Thursday, June 16, 2011 at 7:35 PM, Eric Mill wrote:

On Thu, Jun 16, 2011 at 7:11 AM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

  • how do you do highlighting? Are term vectors stored for the field you highlight on? Is it also stored?

It's all the defaults; I haven't specified a mapping file. The _mapping endpoint simply says {type: "string"} for the field in question.

Would term_vector highlighting address the problem? I don't mind a larger index size. Is there any other factor I should keep in mind when deciding whether to use term vectors?

I'm new to the world of Lucene and document searching, so it's likely I'm asking naive questions here.

  • How big are the document and field that you try and highlight on?

In the mapping being searched, there's at least one matching document that is 15MB. The field being searched takes up nearly all that space. There may be other matches, but that's the only humongous one.

Thanks again for your help.

-- Eric

On Thursday, June 16, 2011 at 2:47 AM, Eric Mill wrote:

OK, I've got a lot more info about the problem. I've now isolated it to a query, not a machine. EC2 isn't the problem, as I first thought it was. I can trigger ElasticSearch to go into a death spiral (where it spikes the CPU and stays there until kill -9'd) the first time this query is run on this set of documents.

The query is a dis_max on a set of 6 "text" queries of type "phrase", each for the term "cap and trade". Highlighting is enabled on this query, and when I disable highlighting, the query works rapidly and doesn't kill ElasticSearch.

It also only kills ElasticSearch when some particular document or documents are present. I've tried cutting out large swathes of documents, and this makes the query work okay. On the full set, the query reliably kills ElasticSearch.

The "and" is causing the query to become more complex, because despite the fact that it's a phrase query, I think the "and" is getting treated as a boolean operator. I tried "health and care", for example, and this sent ElasticSearch into a mini-spiral that it managed to recover from, but returned me results with "health care" as a phrase present, but where each word was highlighted separately. Possibly it's just getting removed as a stop word, but I would think that the highlighting would apply to the phrase as a whole, in that case.

So my questions now are:

  • Is this enough information to hazard any guesses about why ElasticSearch might spiral out of control?

  • How can I make an exact phrase query on a text field but not have "and" considered a boolean operator?

-- Eric

On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill <kprojection@gmail.com (mailto:kprojection@gmail.com)> wrote:

It's an m1.large, which has high I/O. I'm using the latest version, elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more likely they are hurt by "virtualization noise", also, make sure to use ones with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about it as well (from people using redis) on ec2, but did not encounter it with elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS? Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of control? I have the logs, running at DEBUG level, from an example incident. The same query gets executed, and on the 10th or so run, the memory starts piling up, and top shows the process taking anywhere from 100 to 177% of the CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an index of about 17,000 documents, where each document ranges from a few kilobytes to a few megabytes (not many of those). The data directory takes up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(Eric Mill) #10

So I did both of these (stored the field on the index, and enabled term
vector highlighting (and reindexed)), and for most queries, this speeds up
highlighting, and puts the server under less load. The query that used to
kill ElasticSearch ("cap and trade") now is quite fast.

There are some queries that were fine before, that now exhibit higher
strain. I've (only) once, and not reproducibly, sent ES into a death spiral
since with one of them (a one word query ("health"). This is probably fine,
for now, but is going to be a problem in the long run.

I know I haven't given you a lot of detail about my current setup, but can
you think of any other general optimizations I could perform besides
enabling term vector highlighting and storing the fields being searched on
the index?

-- Eric

On Thu, Jun 16, 2011 at 1:02 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, if you enable term vector highlighting, then it will be faster to
highlight that field. (you will need to reindex the data for that), also, if
you store that field, then there won't be a need to load it from _source and
parse it. I suggest creating a simple one node cluster with several of those
big documents, and testing it separately.

On Thursday, June 16, 2011 at 7:35 PM, Eric Mill wrote:

On Thu, Jun 16, 2011 at 7:11 AM, Shay Banon shay.banon@elasticsearch.comwrote:

  • how do you do highlighting? Are term vectors stored for the field you
    highlight on? Is it also stored?

It's all the defaults; I haven't specified a mapping file. The _mapping
endpoint simply says {type: "string"} for the field in question.

Would term_vector highlighting address the problem? I don't mind a larger
index size. Is there any other factor I should keep in mind when deciding
whether to use term vectors?

I'm new to the world of Lucene and document searching, so it's likely I'm
asking naive questions here.

  • How big are the document and field that you try and highlight on?

In the mapping being searched, there's at least one matching document that
is 15MB. The field being searched takes up nearly all that space. There may
be other matches, but that's the only humongous one.

Thanks again for your help.

-- Eric

On Thursday, June 16, 2011 at 2:47 AM, Eric Mill wrote:

OK, I've got a lot more info about the problem. I've now isolated it to a
query, not a machine. EC2 isn't the problem, as I first thought it was. I
can trigger ElasticSearch to go into a death spiral (where it spikes the CPU
and stays there until kill -9'd) the first time this query is run on this
set of documents.

The query is a dis_max on a set of 6 "text" queries of type "phrase", each
for the term "cap and trade". Highlighting is enabled on this query, and
when I disable highlighting, the query works rapidly and doesn't kill
ElasticSearch.

It also only kills ElasticSearch when some particular document or documents
are present. I've tried cutting out large swathes of documents, and this
makes the query work okay. On the full set, the query reliably kills
ElasticSearch.

The "and" is causing the query to become more complex, because despite the
fact that it's a phrase query, I think the "and" is getting treated as a
boolean operator. I tried "health and care", for example, and this sent
ElasticSearch into a mini-spiral that it managed to recover from, but
returned me results with "health care" as a phrase present, but where each
word was highlighted separately. Possibly it's just getting removed as a
stop word, but I would think that the highlighting would apply to the phrase
as a whole, in that case.

So my questions now are:

  • Is this enough information to hazard any guesses about why ElasticSearch
    might spiral out of control?

  • How can I make an exact phrase query on a text field but not have "and"
    considered a boolean operator?

-- Eric

On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill kprojection@gmail.com wrote:

It's an m1.large, which has high I/O. I'm using the latest version,
elasticsearch v0.16.2.

I'll try tomorrow morning to see if preventing it from swapping fixes it.

-- Eric

On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Few more questions:

  1. Which ec2 instance type are you using? The smaller you get, the more
    likely they are hurt by "virtualization noise", also, make sure to use ones
    with high IO.
  2. Which elasticsearch version are you using?

I am not familiar with problems on 10.10, though I heard bad stories about
it as well (from people using redis) on ec2, but did not encounter it with
elasticsearch.

On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:

  1. I'll try that.
  2. It's Ubuntu 10.10. Still known problems?
  3. No, indexing is done in bulk once a night (for now).
  4. The data directory is on an EBS, but it doesn't appear to be
    misbehaving. I'll investigate a bit more on that front.

Thanks, Shay,

-- Eric

On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Some typical things to check:

  1. Make sure the ES (java) process is not swapping. You can force it not to
    swap by setting bootstrap.mlockall to true.
  2. Which operating system are you running? Ubuntu 10.04 is known to have
    problems on ec2.
  3. Are you indexing while you do the search?
  4. Where do you store the data directory? Is it on local drive or EBS?
    Maybe EBS is misbehaving?

On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:

Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.

https://gist.github.com/1025642

I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.

The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.

Sorry for all the emails - I guess today's my day for questions!

-- Eric


(system) #11