Phrase suggester and non-existing terms


(Christoph Haas-2) #1

Dear list,

I'm about to implement a full-text search for a web site and am a bit
stuck on a "did you mean"-like functionality. If no results were found I
would like to point the visitor to more promising search queries. So
what I'm doing:

curl -s -XPOST 'torf:9200/debshots/jdbc/_search?search_type=count' -d '{
"suggest" : {
"text" : "editar zaphodbeeblebrox",
"simple_phrase" : {
"phrase" : {
"analyzer" : "simple",
"field" : "description",
"size" : 4,
"real_word_error_likelihood" : 0.95,
"confidence" : 2.0,
"gram_size" : 1,
"highlight": {
"pre_tag": "",
"post_tag": "
"
},
"direct_generator" : [{
"field" : "description",
"suggest_mode" : "missing",
"min_doc_freq" : 1,
"min_word_len" : 1
}]
}
}
}
}' | json_pp

I would expect that I get "editor" back because "editar" was a typo and
"zaphodbeeblebrox" does not exist anywhere in my index. But what I get
instead is:

{
"hits" : {
"hits" : [],
"max_score" : 0,
"total" : 43682
},
"timed_out" : false,
"suggest" : {
"simple_phrase" : [
{
"length" : 23,
"options" : [
{
"text" : "editor zaphodbeeblebrox",
"score" : 0.00022499196,
"highlighted" : "editor zaphodbeeblebrox"
},
{
"text" : "edit zaphodbeeblebrox",
"score" : 6.074264e-05,
"highlighted" : "edit zaphodbeeblebrox"
},
{
"text" : "editors zaphodbeeblebrox",
"score" : 4.8506903e-05,
"highlighted" : "editors zaphodbeeblebrox"
}
],
"text" : "editar zaphodbeeblebrox",
"offset" : 0
}
]
},
"_shards" : {
"failed" : 0,
"successful" : 1,
"total" : 1
},
"took" : 16
}

The problem is that if a user would start to search for "editors
zaphodbeeblebrox" as suggest he would find no documents. That will annoy the users.

So my question is: how can I change the search / phrase-suggest query to
remove words that are not found anyway? I was tempted to take the
highlighted suggestions and just keep the words between the
tags but that feels dirty and I'm sure there is a better way.

Thanks in advance for your help.

…Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da5dc2c0-9e4a-4563-b78a-129660073197%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #2

I think the only way to do this is to raise the confidence level up. Did
you try this?

simon

On Friday, November 29, 2013 12:23:20 AM UTC+1, Christoph Haas wrote:

Dear list,

I'm about to implement a full-text search for a web site and am a bit
stuck on a "did you mean"-like functionality. If no results were found I
would like to point the visitor to more promising search queries. So
what I'm doing:

curl -s -XPOST 'torf:9200/debshots/jdbc/_search?search_type=count' -d '{
"suggest" : {
"text" : "editar zaphodbeeblebrox",
"simple_phrase" : {
"phrase" : {
"analyzer" : "simple",
"field" : "description",
"size" : 4,
"real_word_error_likelihood" : 0.95,
"confidence" : 2.0,
"gram_size" : 1,
"highlight": {
"pre_tag": "",
"post_tag": "
"
},
"direct_generator" : [{
"field" : "description",
"suggest_mode" : "missing",
"min_doc_freq" : 1,
"min_word_len" : 1
}]
}
}
}
}' | json_pp

I would expect that I get "editor" back because "editar" was a typo and
"zaphodbeeblebrox" does not exist anywhere in my index. But what I get
instead is:

{
"hits" : {
"hits" : [],
"max_score" : 0,
"total" : 43682
},
"timed_out" : false,
"suggest" : {
"simple_phrase" : [
{
"length" : 23,
"options" : [
{
"text" : "editor zaphodbeeblebrox",
"score" : 0.00022499196,
"highlighted" : "editor zaphodbeeblebrox"
},
{
"text" : "edit zaphodbeeblebrox",
"score" : 6.074264e-05,
"highlighted" : "edit zaphodbeeblebrox"
},
{
"text" : "editors zaphodbeeblebrox",
"score" : 4.8506903e-05,
"highlighted" : "editors zaphodbeeblebrox"
}
],
"text" : "editar zaphodbeeblebrox",
"offset" : 0
}
]
},
"_shards" : {
"failed" : 0,
"successful" : 1,
"total" : 1
},
"took" : 16
}

The problem is that if a user would start to search for "editors
zaphodbeeblebrox" as suggest he would find no documents. That will annoy the users.

So my question is: how can I change the search / phrase-suggest query to
remove words that are not found anyway? I was tempted to take the
highlighted suggestions and just keep the words between the
tags but that feels dirty and I'm sure there is a better way.

Thanks in advance for your help.

…Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6d3bde1-96e9-4ae0-aa4b-55eedeb11220%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Christoph Haas) #3

Thanks for your answer. Yes, I tried that. The higher I set the
confidence the fewer valid suggestions for existing terms come up. But
the term that does not remotely match any term in the index stays.

For example with a confidence of 0.1 I get:

  • editor zaphodbeeblebrox
  • edit zaphodbeeblebrox
  • editing zaphodbeeblebrox

At a confidence level of 1 I get:

  • editor zaphodbeeblebrox

And at higher levels I just get no results at all.

But what I would like to get is:

  • editor

In my opinion a "did you mean" query would not make sense if it left
search words in the query that will lead to zero documents when I let
the user search for them.

…Christoph

Am 29.11.2013 22:20, schrieb simonw:

I think the only way to do this is to raise the confidence level up.
Did you try this?

simon

On Friday, November 29, 2013 12:23:20 AM UTC+1, Christoph Haas wrote:

Dear list,

I'm about to implement a full-text search for a web site and am a bit
stuck on a "did you mean"-like functionality. If no results were found I
would like to point the visitor to more promising search queries. So
what I'm doing:

curl -s -XPOST 'torf:9200/debshots/jdbc/_search?search_type=count' -d '{
 "suggest" : {
    "text" : "editar zaphodbeeblebrox",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "simple",
        "field" : "description",
        "size" : 4,
        "real_word_error_likelihood" : 0.95,
        "confidence" : 2.0,
        "gram_size" : 1,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
          },
        "direct_generator" : [{
          "field" : "description",
          "suggest_mode" : "missing",
          "min_doc_freq" : 1,
          "min_word_len" : 1
        }]
      }
    }
  }
}' | json_pp

I would expect that I get "editor" back because "editar" was a typo and
"zaphodbeeblebrox" does not exist anywhere in my index. But what I get
instead is:

{
   "hits" : {
      "hits" : [],
      "max_score" : 0,
      "total" : 43682
   },
   "timed_out" : false,
   "suggest" : {
      "simple_phrase" : [
         {
            "length" : 23,
            "options" : [
               {
                  "text" : "editor zaphodbeeblebrox",
                  "score" : 0.00022499196,
                  "highlighted" : "<em>editor</em> zaphodbeeblebrox"
               },
               {
                  "text" : "edit zaphodbeeblebrox",
                  "score" : 6.074264e-05,
                  "highlighted" : "<em>edit</em> zaphodbeeblebrox"
               },
               {
                  "text" : "editors zaphodbeeblebrox",
                  "score" : 4.8506903e-05,
                  "highlighted" : "<em>editors</em> zaphodbeeblebrox"
               }
            ],
            "text" : "editar zaphodbeeblebrox",
            "offset" : 0
         }
      ]
   },
   "_shards" : {
      "failed" : 0,
      "successful" : 1,
      "total" : 1
   },
   "took" : 16
}

The problem is that if a user would start to search for "editors
zaphodbeeblebrox" as suggest he would find no documents. That will annoy the users.

So my question is: how can I change the search / phrase-suggest query to
remove words that are not found anyway? I was tempted to take the
highlighted suggestions and just keep the words between the <em>…</em>
tags but that feels dirty and I'm sure there is a better way.

Thanks in advance for your help.

…Christoph

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c6d3bde1-96e9-4ae0-aa4b-55eedeb11220%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
A distributed system is one in which I cannot get something done
because a machine I've never heard of is down. (Leslie Lamport)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5299B243.2030402%40christoph-haas.de.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #4

On Sat, Nov 30, 2013 at 4:39 AM, Christoph Haas email@christoph-haas.dewrote:

Thanks for your answer. Yes, I tried that. The higher I set the
confidence the fewer valid suggestions for existing terms come up. But
the term that does not remotely match any term in the index stays.

For example with a confidence of 0.1 I get:

  • editor zaphodbeeblebrox
  • edit zaphodbeeblebrox
  • editing zaphodbeeblebrox

At a confidence level of 1 I get:

  • editor zaphodbeeblebrox

And at higher levels I just get no results at all.

But what I would like to get is:

  • editor

In my opinion a "did you mean" query would not make sense if it left
search words in the query that will lead to zero documents when I let
the user search for them.

I think I understand the problem here. If you are using AND as the default
query operator then it does you no good to get any suggestions back that
contain terms that aren't in any documents. Stemming helps with this, but
only a bit.

It'd be nice if he could get "editor" highlighted as "editor
zaphodbeeblebrox".

I think that is possible but not yet implemented.

One thing to keep in mind is that the suggester never actually makes sure
that the suggestions produce results. It just suggests text that is more
likely to produce results. Elasticsearch has an issue in flight to make
sure of that (https://github.com/elasticsearch/elasticsearch/issues/3482)
but I don't know when we'll get movement on it.

Even with 3482 this feature would be useful as it'd clean up some of the
suggestions that had no chance of passing the filter.

So my suggestion is to open an issue. Fix it if you are feeling brave. If
not I'll have a look at it when I get the chance, though I don't know when
that'll be.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0rp7wrFdPY5H%2Bw8h2Ucg1LhQQOpdoN%2BBqxc5hqfszxdg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Christoph Haas) #5

Nik, thanks for your reply…

Am 30.11.2013 17:44, schrieb Nikolas Everett:

I think I understand the problem here. If you are using AND as the
default query operator then it does you no good to get any suggestions
back that contain terms that aren't in any documents. Stemming helps
with this, but only a bit.
I see. So the approach would be to use "or" as the default operator so
that it doesn't matter if terms like "zaphodbeeblebrox" would stay in
the query as they would just get ignored.
It'd be nice if he could get "editor" highlighted as "editor
zaphodbeeblebrox".
Yes, something like that. But of course that's just a matter of two
lines of script code on my (Ruby) side. I just wondered if there is a
better way.
I think that is possible but not yet implemented.

One thing to keep in mind is that the suggester never actually makes
sure that the suggestions produce results. It just suggests text that
is more likely to produce results. Elasticsearch has an issue in
flight to make sure of that
(https://github.com/elasticsearch/elasticsearch/issues/3482) but I
don't know when we'll get movement on it.
Very good. That sounds interesting. I understand that making sure a
suggestion gives useful results costs more computing resources. But then
again a web site should rather spend a second to give me relevant
results than spend just 5ms and come up with something that is
syntactically correct but won't find me what I'm looking for. I'll be
happy when one day issue 3482 will be resolved - until then I'll just go
with the "or" approach.
So my suggestion is to open an issue. Fix it if you are feeling brave.
If not I'll have a look at it when I get the chance, though I don't
know when that'll be.
I wished I would be able to help. But I have no Java skills. And to be
honest I have a very hard time figuring out how to use Elasticsearch
properly. Perhaps in a few weeks I can help contributing improvements to
the documentation.

Thanks again.

…Christoph

--
A distributed system is one in which I cannot get something done
because a machine I've never heard of is down. (Leslie Lamport)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/529A29AF.4040206%40christoph-haas.de.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #6

On Sat, Nov 30, 2013 at 1:08 PM, Christoph Haas email@christoph-haas.dewrote:

Nik, thanks for your reply…

Am 30.11.2013 17:44, schrieb Nikolas Everett:

I think I understand the problem here. If you are using AND as the
default query operator then it does you no good to get any suggestions
back that contain terms that aren't in any documents. Stemming helps
with this, but only a bit.
I see. So the approach would be to use "or" as the default operator so
that it doesn't matter if terms like "zaphodbeeblebrox" would stay in
the query as they would just get ignored.

OR vs AND is kind of hard. Lots of people go with OR and let the items
with more results get pulled higher. Others go with AND because that is
what people are used to. I use AND but I think I'm in the minority of
Elasticsearch users.

It'd be nice if he could get "editor" highlighted as "editor
zaphodbeeblebrox".
Yes, something like that. But of course that's just a matter of two
lines of script code on my (Ruby) side. I just wondered if there is a
better way.

It wouldn't be right to just remove any terms that Elasticsearch doesn't
highlight because it only highlights the terms that it decided should be
changed. Not highlighting the term either just means Elasticsearch
couldn't find anything better to replace it with. This means that it
should either stay in the query because it is the perfect term already or
it should be removed because it isn't in any of the indexes and there isn't
a term close enough to it to replace it with.

I think that is possible but not yet implemented.

One thing to keep in mind is that the suggester never actually makes
sure that the suggestions produce results. It just suggests text that
is more likely to produce results. Elasticsearch has an issue in
flight to make sure of that
(https://github.com/elasticsearch/elasticsearch/issues/3482) but I
don't know when we'll get movement on it.
Very good. That sounds interesting. I understand that making sure a
suggestion gives useful results costs more computing resources. But then
again a web site should rather spend a second to give me relevant
results than spend just 5ms and come up with something that is
syntactically correct but won't find me what I'm looking for. I'll be
happy when one day issue 3482 will be resolved - until then I'll just go
with the "or" approach.
So my suggestion is to open an issue. Fix it if you are feeling brave.
If not I'll have a look at it when I get the chance, though I don't
know when that'll be.
I wished I would be able to help. But I have no Java skills. And to be
honest I have a very hard time figuring out how to use Elasticsearch
properly. Perhaps in a few weeks I can help contributing improvements to
the documentation.

Cool. If you open an issue it'll at least end up on someone's radar.

Thanks again.

…Christoph

--
A distributed system is one in which I cannot get something done
because a machine I've never heard of is down. (Leslie Lamport)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/529A29AF.4040206%40christoph-haas.de
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2Luu%3DPV%2BBg88uCTfU-b8u2Syvk39%3Dejd3H95d1GUXp8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7