Suggestions on how to tweak search accuracy


(Shane Witbeck) #1

I'd like to know what other people do to efficiently tweak searching (via
boost, etc.) to get more accurate search results. I'm working to move from
using Google Search Appliance to ES and the search results are wildly
different.


(Shane Witbeck) #2

I should also mention I'm using the Java API and ES version 0.18.5. I've
done some initial testing with changing boost values for a couple of fields
but the values don't seem to have an impact. Do I need to also change boost
values at index time for this to work? Here's an example of what I have:

boolQuery.must(
queryString(terms)
.analyzer("snowball")
.field("message", 1)
.field("subject", 3)
);

Thanks.


(Clinton Gormley) #3

On Fri, 2012-01-06 at 14:37 -0800, Shane Witbeck wrote:

I should also mention I'm using the Java API and ES version 0.18.5.
I've done some initial testing with changing boost values for a couple
of fields but the values don't seem to have an impact. Do I need to
also change boost values at index time for this to work? Here's an
example of what I have:

boolQuery.must(
queryString(terms)
.analyzer("snowball")
.field("message", 1)
.field("subject", 3)
);

Hi Shane,

You'll need to provide more details. Currently we don't have much more
info than "it doesn't do what i expect".

Put together a simple example using curl (see
http://www.elasticsearch.org/help ) showing what you are doing, what
results you get and how they are different from what you expect, and
we'll be able to give you much more useful answers

clint


(Shane Witbeck) #4

Clinton, thanks for the reply. Apologies for not giving more information at
the start. Here's more information:

The gist contains cluster health info, sample query, and meta including
mapping for the "threads" index that I'm querying. Please let me know what
else you may need.

My goal is to start applying tweaks to the index/search queries so that
they match better with our existing search solution. As I mentioned before,
I tried giving the "subject" field a higher boost value (3) in the search
query and leaving the "message" boost value at (1) for the query "jboss ssl
install". My desire is to boost documents that contain the query terms in
the "subject" field. The "subject" field is typically much shorter in
length and the "message" field can be very large. The context here is that
this is searching forums where the "subject" field is the topic of the
original forum post and the "message" field is an aggregate of the original
forum body and all the reply posts.

Another goal is to return a lower number of total documents. ES seems to
return many more (orders of magnitude) results based on the same queries.
Is there a minimum score threshold setting to accomplish this?

I'm using the snowball analyzer here as well which I'm not familiar with so
any pointers to in depth documentation on that would be nice too.

Thanks,
Shane


(Shane Witbeck) #5

After doing some more testing I found that removing the sort dramatically
improves search accuracy. What's the trick to having the accuracy AND apply
a default sort? Here's the updated query without sort:

Thanks,
Shane


(Clinton Gormley) #6

Hi Shane

My goal is to start applying tweaks to the index/search queries so
that they match better with our existing search solution. As I
mentioned before, I tried giving the "subject" field a higher boost
value (3) in the search query and leaving the "message" boost value at
(1) for the query "jboss ssl install". My desire is to boost documents
that contain the query terms in the "subject" field. The "subject"
field is typically much shorter in length and the "message" field can
be very large. The context here is that this is searching forums where
the "subject" field is the topic of the original forum post and the
"message" field is an aggregate of the original forum body and all the
reply posts.

OK - a few things here:

  1. in your gist, you're actually boosting message and subject equally
    (ie 1.0)

  2. you're then throwing away the score (relevance) and sorting by
    lastPostDate instead

  3. your query is over-complicated by using and|or AND a bool query

Here is an example of your query that should produce good results:

And here is the same query but tweaked to boost (increase the relevancy)
of posts made since 2011-01-01:

clint


(Shane Witbeck) #7

Regarding #2, there's really no way to accomplish the more accurate
(without sort) search THEN sort? If so, I'm thinking the thing to do here
is NOT provide the default sort on the initial query so more accurate
results are produced, then allow users to sort if they choose to. There are
many other sorting fields I'm providing on the UI besides the date.

Thanks for your very helpful guidance and reworked queries!


(Clinton Gormley) #8

On Mon, 2012-01-09 at 08:59 -0800, Shane Witbeck wrote:

Regarding #2, there's really no way to accomplish the more accurate
(without sort) search THEN sort? If so, I'm thinking the thing to do
here is NOT provide the default sort on the initial query so more
accurate results are produced, then allow users to sort if they choose
to. There are many other sorting fields I'm providing on the UI
besides the date.

If you don't provide a sort parameter, then it defaults to:

sort: [{ "_score": "asc" }]

So you can combine that with other sort parameters if you like

clint


(system) #9