Understanding tokenization for auto-complete


(Petar Djekic) #1

I'm using ES's auto-completion and i'd like to understand how the prefix
tokenization works. Example queries and results its returning currently:

  1. 'blackb' -> 'Blackberry Q10 Red'
    --> Expected

  2. 'q' -> 'Blackberry Q10 Red'
    --> Expected

  3. 'q10' -> No result, expected is 'Blackberry Q10 Red'
    --> Why are results returned when typing in 'q' but not 'q10'?

  4. 'blackberry q10'
    --> Expected

  5. 'sam' -> 'Samsung Galaxy S5'
    --> Expected

  6. 'galax' > -> 'Samsung Galaxy S5'
    --> Expected

  7. 'S5'
    -> No result, expected is 'Samsung Galaxy S5'

I'm indexing the document using input: ["blackberry, "Q10 Red"], input:
["samsung", "galaxy s5"], please find the mapping / query below. I thought
the standard tokenizer would also tokenize on whitespaces and hence give
result for S5, also i don't understand why 'q' gives results but 'q10'
doesn't. Can i use the prefix tokenizer for such a use case or would it
need to switch to ngrams completely?

My mapping looks as follow:

     "mappings" : {

            "suggestions" : {

                 "_timestamp": {

                   "enabled": true,

                   "path" : "lastTimestamp"

                  },

                 "properties" : {

                   "suggest" : { "type" : "completion",

                                "index_analyzer" : "standard",

                                "search_analyzer" : "simple",

                                "payloads" : true,

                                "context" : {

                                  "type" : {

                                    "type" : "category",

                                    "path" : "entity"

and query like this:

{

        "suggestions" : {

            "text" : "<query>'",

            "completion" : {

                "size" : 5,

                "field" : "suggest",

                    "fuzzy" : {

            "fuzziness" : 1

        },

                "context" : {

                    "type" : "<internalcategory>'"

                }

            }

        }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a917898-8525-4916-807e-cb72001b30c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alexander Reelsen) #2

Hey,

two things: First, the completion suggester uses the simple analyzer by
default. Please use the analyze API with some of your terms and you will
see, why they dont match anymore (especially the q10). You may want to try
out the inquisitor plugin for that, it has a nice web UI for the analyze
API.
Second: The completion suggester is a prefix suggester, so "galaxy S5'
requires you to type "galaxy" first before typing "S5"...

Hope this helps

--Alex

On Thu, Jul 31, 2014 at 11:20 AM, Petar Djekic <
petar.djekic@rocket-internet.de> wrote:

I'm using ES's auto-completion and i'd like to understand how the prefix
tokenization works. Example queries and results its returning currently:

  1. 'blackb' -> 'Blackberry Q10 Red'
    --> Expected

  2. 'q' -> 'Blackberry Q10 Red'
    --> Expected

  3. 'q10' -> No result, expected is 'Blackberry Q10 Red'
    --> Why are results returned when typing in 'q' but not 'q10'?

  4. 'blackberry q10'
    --> Expected

  5. 'sam' -> 'Samsung Galaxy S5'
    --> Expected

  6. 'galax' > -> 'Samsung Galaxy S5'
    --> Expected

  7. 'S5'
    -> No result, expected is 'Samsung Galaxy S5'

I'm indexing the document using input: ["blackberry, "Q10 Red"], input:
["samsung", "galaxy s5"], please find the mapping / query below. I thought
the standard tokenizer would also tokenize on whitespaces and hence give
result for S5, also i don't understand why 'q' gives results but 'q10'
doesn't. Can i use the prefix tokenizer for such a use case or would it
need to switch to ngrams completely?

My mapping looks as follow:

     "mappings" : {

            "suggestions" : {

                 "_timestamp": {

                   "enabled": true,

                   "path" : "lastTimestamp"

                  },

                 "properties" : {

                   "suggest" : { "type" : "completion",

                                "index_analyzer" : "standard",

                                "search_analyzer" : "simple",

                                "payloads" : true,

                                "context" : {

                                  "type" : {

                                    "type" : "category",

                                    "path" : "entity"

and query like this:

{

        "suggestions" : {

            "text" : "<query>'",

            "completion" : {

                "size" : 5,

                "field" : "suggest",

                    "fuzzy" : {

            "fuzziness" : 1

        },

                "context" : {

                    "type" : "<internalcategory>'"

                }

            }

        }

}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4a917898-8525-4916-807e-cb72001b30c5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/4a917898-8525-4916-807e-cb72001b30c5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8X%2BjDEXx7KvvaQ-exhyasrn10JrLVn5aEaDWyAfsamKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3