Auto complete


(Vinicius Carvalho) #1

I've searched across this list on the subject, seeing some stuff, many
strategies, and I'm trying to get one to work here.

So, I tried to index my data with a field named auto_complete using a
different analyzer than the default one for the field name:

{

"analysis" : {

 "analyzer" : {
     "auto_complete" : {
         "type" : "custom",
         "tokenizer" : "custom_edgeNGram",
         "filter" : ["lowercase","custom_ngram"]
      }
  },
      
  "tokenizer" : {
  
      "custom_edgeNGram" : {
          "type" : "edgeNGram",
          "min_gram" : 2,
          "max_gram" : 5 
       }
  },
  
  "filter" : {
      "custom_ngram" : {
          "type" : "nGram",
          "min_gram" : 2,
          "max_gram" : 5 
       }
   }

}

}

{
"artist" : {
"properties" : {
"artist_id" : {"type" : "integer", "store":"no", index:"no",
"include_in_all" : "false"} ,
"name" : {"type" : "string", "store" : "no", "include_in_all" : "true" } ,
"rating" : {"type" : "float", "store":"no", "include_in_all" : "false"},
"elevation" : {"type" : "float", "store":"no" , "include_in_all" : "false",
"index" : "no"},
"alpha" : {"type" : "integer", "store" : "no", "include_in_all" : "false" },
"auto_complete" : {"type" : "string", "store":"no", "include_in_all" :
"false" , "analyzer" : "auto_complete"}
}

}

}

Well, using this new field for auto_complete queries is ok as long as I'm
searching for only one term. (I tried a mix of text, query_string queries).
There was a message on this list about the same problem, but no answer :frowning:

So basically searching for "Pink" would bring Pink, Pink Floyd, Pink
Panther ... but searching for "Pink Fl" one would expect to have a
different result, but if I search on the nGrams I get the same output.
Sometimes even "worse" results.

I'd like to hear, what's the general strategy you guys use for
auto_complete. I had some success using boolean query with AND operators on
each term and "text_phrase_prefix" but, the response time was really high
(something I really would not like for an autocomplete function).

Regards


(Ævar Arnfjörð Bjarmason) #2

The problem is that you have a single analyzer for your
"auto_complete" field, you need two analyzers, one search analyzer,
and one index analyzer.

Say you index Pink Floyd with your search analyzer being custom_ngram,
then your index will contain, for document id X:

Pi -> X
Pin -> X
Pink -> X

etc., which is what you want, but since you set both the search and
the index analyzer to custom_ngram a search for "Pin" is going to
break down "Pin" into:

Pi
Pin

So you'll also search for anything beginning with "Pi", like "Pie".

So change this:

"auto_complete" : {"type" : "string", "store":"no",

"include_in_all" : "false" , "analyzer" : "auto_complete"}

to something like:

"auto_complete" : {"type" : "string", "store":"no",

"include_in_all" : "false" , "index_analyzer" : "auto_complete",
"search_analyzer" : "standard"}

(but perhaps something other than standard, since that'll strip stopwords.


(Vinicius Carvalho) #3

Thanks, that worked. Also I've changed the definition of filters before and
had same result:

{

"analysis" : {

 "analyzer" : {
     "auto_complete" : {
         "type" : "custom",
         "tokenizer" : "whitespace",
         "filter" : ["stop","lowercase","custom_ngram"]
      }
  },
      
  
  "filter" : {
      "custom_ngram" : {
          "type" : "nGram",
          "min_gram" : 2,
          "max_gram" : 5 
       }
   }

}

}

Getting better results now, still not the ones I was expecting. I'll try
using shingle now, nGrams are returning some unexpected stuff as well.

For instance, searching for "Iron Mai" using AND as operator, I would like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm
looking for, but the exact match should be boosted.

Regards

On Wednesday, July 18, 2012 4:56:33 PM UTC-4, Vinicius Carvalho wrote:

I've searched across this list on the subject, seeing some stuff, many
strategies, and I'm trying to get one to work here.

So, I tried to index my data with a field named auto_complete using a
different analyzer than the default one for the field name:

{

"analysis" : {

 "analyzer" : {
     "auto_complete" : {
         "type" : "custom",
         "tokenizer" : "custom_edgeNGram",
         "filter" : ["lowercase","custom_ngram"]
      }
  },
      
  "tokenizer" : {
  
      "custom_edgeNGram" : {
          "type" : "edgeNGram",
          "min_gram" : 2,
          "max_gram" : 5 
       }
  },
  
  "filter" : {
      "custom_ngram" : {
          "type" : "nGram",
          "min_gram" : 2,
          "max_gram" : 5 
       }
   }

}

}

{
"artist" : {
"properties" : {
"artist_id" : {"type" : "integer", "store":"no", index:"no",
"include_in_all" : "false"} ,
"name" : {"type" : "string", "store" : "no", "include_in_all" : "true" } ,
"rating" : {"type" : "float", "store":"no", "include_in_all" : "false"},
"elevation" : {"type" : "float", "store":"no" , "include_in_all" :
"false", "index" : "no"},
"alpha" : {"type" : "integer", "store" : "no", "include_in_all" : "false"
},
"auto_complete" : {"type" : "string", "store":"no", "include_in_all" :
"false" , "analyzer" : "auto_complete"}
}

}

}

Well, using this new field for auto_complete queries is ok as long as I'm
searching for only one term. (I tried a mix of text, query_string queries).
There was a message on this list about the same problem, but no answer :frowning:

So basically searching for "Pink" would bring Pink, Pink Floyd, Pink
Panther ... but searching for "Pink Fl" one would expect to have a
different result, but if I search on the nGrams I get the same output.
Sometimes even "worse" results.

I'd like to hear, what's the general strategy you guys use for
auto_complete. I had some success using boolean query with AND operators on
each term and "text_phrase_prefix" but, the response time was really high
(something I really would not like for an autocomplete function).

Regards


(Ævar Arnfjörð Bjarmason) #4

On Wed, Jul 18, 2012 at 11:48 PM, Vinicius Carvalho
viniciusccarvalho@gmail.com wrote:

For instance, searching for "Iron Mai" using AND as operator, I would like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm looking
for, but the exact match should be boosted.

Lucene has some pretty sane default behavior but it's not going to
read your mind, if you just tell it to find both "foo" and "bar"
somewhere in a field it isn't going to extrapolate from that that
you'd like "foo bar" more than "hello there foo and bar".

You can express all these things by writing a more complex query, e.g.
a bool/should query where you give things that match the user's search
string from the start of the field a higher boost than ones that match
it somewhere within the field.


(David G Ortega) #5

On Wed, Jul 18, 2012 at 11:48 PM, Vinicius Carvalho
viniciusccarvalho@gmail.com wrote:

For instance, searching for "Iron Mai" using AND as operator, I would
like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm
looking
for, but the exact match should be boosted.

Iron Maiden should in theory be the first result in your query if we are
talking that the documents have "Iron Maiden" and "Tribute to Iron Maiden"
just becuause the scoring should be bigger in the first one since the two
words are terms against the second sentence where from 4 words there is two
terms.

Said this the score of the first sentence should be 2/2 * idf versus 2/4 *
idf

Are you appliying some sorting?


(Niranjan) #6

Hello everyone,

I am using ngram too for my autocomplete. My index_analyzer is ngram and
search_analyzer is standard. Along with this, I also use highlight, so that
the matches are highlighted in a different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against - "Play
baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2 sentences get
matched. But when I type "Play", I get 0 matches. If I continue typing to
"Play B", then onwards it rightly matches "Play baseball".

Why is this so?

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word trekking but highlight only
highlights 'trek', and ignores to highlight 'ing'. Why is this again?

Regards

On Thursday, July 19, 2012 1:57:09 PM UTC+5:30, David G Ortega wrote:

On Wed, Jul 18, 2012 at 11:48 PM, Vinicius Carvalho
<vinicius...@gmail.com <javascript:>> wrote:

For instance, searching for "Iron Mai" using AND as operator, I would
like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm
looking
for, but the exact match should be boosted.

Iron Maiden should in theory be the first result in your query if we are
talking that the documents have "Iron Maiden" and "Tribute to Iron Maiden"
just becuause the scoring should be bigger in the first one since the two
words are terms against the second sentence where from 4 words there is two
terms.

Said this the score of the first sentence should be 2/2 * idf versus 2/4 *
idf

Are you appliying some sorting?

--


(Niranjan) #7

To add to this:

I have a sentence "watch a play" in my db.

When I type "watch a play" in my autocomplete input box, it shows this
sentence with only 'watch' being highlighted, when in fact the entire
sentence should be highlighted. Why is this happening? I am completely
stumped.

On Wednesday, August 15, 2012 10:12:32 PM UTC+5:30, Niranjan wrote:

Hello everyone,

I am using ngram too for my autocomplete. My index_analyzer is ngram and
search_analyzer is standard. Along with this, I also use highlight, so that
the matches are highlighted in a different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

Why is this so?

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word trekking but highlight only
highlights 'trek', and ignores to highlight 'ing'. Why is this again?

Regards

On Thursday, July 19, 2012 1:57:09 PM UTC+5:30, David G Ortega wrote:

On Wed, Jul 18, 2012 at 11:48 PM, Vinicius Carvalho
vinicius...@gmail.com wrote:

For instance, searching for "Iron Mai" using AND as operator, I would
like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm
looking
for, but the exact match should be boosted.

Iron Maiden should in theory be the first result in your query if we are
talking that the documents have "Iron Maiden" and "Tribute to Iron Maiden"
just becuause the scoring should be bigger in the first one since the two
words are terms against the second sentence where from 4 words there is two
terms.

Said this the score of the first sentence should be 2/2 * idf versus 2/4

  • idf

Are you appliying some sorting?

--


(Keith Webster) #8

Hey Niranjan,

I believe the reason this is happening for you is because you have set your
max ngram to 5 so the ngram matching stops at 5 characters.

On Wednesday, 15 August 2012 12:15:48 UTC-5, Niranjan wrote:

To add to this:

I have a sentence "watch a play" in my db.

When I type "watch a play" in my autocomplete input box, it shows this
sentence with only 'watch' being highlighted, when in fact the entire
sentence should be highlighted. Why is this happening? I am completely
stumped.

On Wednesday, August 15, 2012 10:12:32 PM UTC+5:30, Niranjan wrote:

Hello everyone,

I am using ngram too for my autocomplete. My index_analyzer is ngram and
search_analyzer is standard. Along with this, I also use highlight, so that
the matches are highlighted in a different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

Why is this so?

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word trekking but highlight only
highlights 'trek', and ignores to highlight 'ing'. Why is this again?

Regards

On Thursday, July 19, 2012 1:57:09 PM UTC+5:30, David G Ortega wrote:

On Wed, Jul 18, 2012 at 11:48 PM, Vinicius Carvalho
vinicius...@gmail.com wrote:

For instance, searching for "Iron Mai" using AND as operator, I would
like
to have the phrase Iron Maiden to have a more relevant result than "A
Tribute to Iron Maiden", I know they both have the same sentence I'm
looking
for, but the exact match should be boosted.

Iron Maiden should in theory be the first result in your query if we are
talking that the documents have "Iron Maiden" and "Tribute to Iron Maiden"
just becuause the scoring should be bigger in the first one since the two
words are terms against the second sentence where from 4 words there is two
terms.

Said this the score of the first sentence should be 2/2 * idf versus 2/4

  • idf

Are you appliying some sorting?

--


(Iftekharul Haque) #9

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word trekking but highlight only
highlights 'trek', and ignores to highlight 'ing'. Why is this again?

This may be if you are stemming during search analysis.

--


(system) #10