Yes, if you enable term vector highlighting, then it will be faster to
highlight that field. (you will need to reindex the data for that), also, if
you store that field, then there won't be a need to load it from _source and
parse it. I suggest creating a simple one node cluster with several of those
big documents, and testing it separately.
On Thursday, June 16, 2011 at 7:35 PM, Eric Mill wrote:
On Thu, Jun 16, 2011 at 7:11 AM, Shay Banon shay.banon@elasticsearch.comwrote:
- how do you do highlighting? Are term vectors stored for the field you
highlight on? Is it also stored?
It's all the defaults; I haven't specified a mapping file. The _mapping
endpoint simply says {type: "string"} for the field in question.
Would term_vector highlighting address the problem? I don't mind a larger
index size. Is there any other factor I should keep in mind when deciding
whether to use term vectors?
I'm new to the world of Lucene and document searching, so it's likely I'm
asking naive questions here.
- How big are the document and field that you try and highlight on?
In the mapping being searched, there's at least one matching document that
is 15MB. The field being searched takes up nearly all that space. There may
be other matches, but that's the only humongous one.
Thanks again for your help.
-- Eric
On Thursday, June 16, 2011 at 2:47 AM, Eric Mill wrote:
OK, I've got a lot more info about the problem. I've now isolated it to a
query, not a machine. EC2 isn't the problem, as I first thought it was. I
can trigger Elasticsearch to go into a death spiral (where it spikes the CPU
and stays there until kill -9'd) the first time this query is run on this
set of documents.
The query is a dis_max on a set of 6 "text" queries of type "phrase", each
for the term "cap and trade". Highlighting is enabled on this query, and
when I disable highlighting, the query works rapidly and doesn't kill
Elasticsearch.
It also only kills Elasticsearch when some particular document or documents
are present. I've tried cutting out large swathes of documents, and this
makes the query work okay. On the full set, the query reliably kills
Elasticsearch.
The "and" is causing the query to become more complex, because despite the
fact that it's a phrase query, I think the "and" is getting treated as a
boolean operator. I tried "health and care", for example, and this sent
Elasticsearch into a mini-spiral that it managed to recover from, but
returned me results with "health care" as a phrase present, but where each
word was highlighted separately. Possibly it's just getting removed as a
stop word, but I would think that the highlighting would apply to the phrase
as a whole, in that case.
So my questions now are:
-
Is this enough information to hazard any guesses about why Elasticsearch
might spiral out of control?
-
How can I make an exact phrase query on a text field but not have "and"
considered a boolean operator?
-- Eric
On Tue, Jun 14, 2011 at 9:31 PM, Eric Mill kprojection@gmail.com wrote:
It's an m1.large, which has high I/O. I'm using the latest version,
elasticsearch v0.16.2.
I'll try tomorrow morning to see if preventing it from swapping fixes it.
-- Eric
On Tue, Jun 14, 2011 at 7:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:
Few more questions:
- Which ec2 instance type are you using? The smaller you get, the more
likely they are hurt by "virtualization noise", also, make sure to use ones
with high IO.
- Which elasticsearch version are you using?
I am not familiar with problems on 10.10, though I heard bad stories about
it as well (from people using redis) on ec2, but did not encounter it with
elasticsearch.
On Wednesday, June 15, 2011 at 1:59 AM, Eric Mill wrote:
- I'll try that.
- It's Ubuntu 10.10. Still known problems?
- No, indexing is done in bulk once a night (for now).
- The data directory is on an EBS, but it doesn't appear to be
misbehaving. I'll investigate a bit more on that front.
Thanks, Shay,
-- Eric
On Tue, Jun 14, 2011 at 4:25 PM, Shay Banon shay.banon@elasticsearch.comwrote:
Some typical things to check:
- Make sure the ES (java) process is not swapping. You can force it not to
swap by setting bootstrap.mlockall to true.
- Which operating system are you running? Ubuntu 10.04 is known to have
problems on ec2.
- Are you indexing while you do the search?
- Where do you store the data directory? Is it on local drive or EBS?
Maybe EBS is misbehaving?
On Tuesday, June 14, 2011 at 10:28 PM, Eric Mill wrote:
Any ideas of why a query might work fine, and then suddenly spike out of
control? I have the logs, running at DEBUG level, from an example incident.
The same query gets executed, and on the 10th or so run, the memory starts
piling up, and top shows the process taking anywhere from 100 to 177% of the
CPU while this is happening.
ElasticSearch running out of control · GitHub
I can get it to occur pretty easily after I restart the server, but only on
my EC2 instance. On my local machine, it's fine.
The query is a dis_max of 6 text/phrase queries on different fields, on an
index of about 17,000 documents, where each document ranges from a few
kilobytes to a few megabytes (not many of those). The data directory takes
up 2.4GB of space.
Sorry for all the emails - I guess today's my day for questions!
-- Eric