Inverse edge back-Ngram (or making it "fuzzy" at the end of a word)?

Hi

We are discussing building an index where possible misspellings at the end
of a word are getting hits.

We were looking at using the EdgeNGram and making ngrams of the last two
characters, but that gives us an index of just the 2-character variations
of the word endings.

How would we best do this? Is it possible to configure the inverse of that?
Should we tokenize it with a regexp? Any other ideas?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:

Hi

We are discussing building an index where possible misspellings at the
end of a word are getting hits.

We were looking at using the EdgeNGram and making ngrams of the last
two characters, but that gives us an index of just the 2-character
variations of the word endings.

How would we best do this? Is it possible to configure the inverse of
that? Should we tokenize it with a regexp? Any other ideas?

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"end_grams" : {
"max_gram" : 2,
"side" : "back",
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer" : {
"end_grams" : {
"filter" : [
"standard",
"lowercase",
"stop",
"end_grams"
],
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick
+brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams'

{

"tokens" : [

{

"end_offset" : 9,

"position" : 1,

"start_offset" : 7,

"type" : "word",

"token" : "ck"

},

{

"end_offset" : 15,

"position" : 2,

"start_offset" : 13,

"type" : "word",

"token" : "wn"

},

{

"end_offset" : 19,

"position" : 3,

"start_offset" : 17,

"type" : "word",

"token" : "ox"

},

{

"end_offset" : 26,

"position" : 4,

"start_offset" : 24,

"type" : "word",

"token" : "ed"

},

{

"end_offset" : 31,

"position" : 5,

"start_offset" : 29,

"type" : "word",

"token" : "er"

},

{

"end_offset" : 40,

"position" : 6,

"start_offset" : 38,

"type" : "word",

"token" : "zy"

},

{

"end_offset" : 44,

"position" : 7,

"start_offset" : 42,

"type" : "word",

"token" : "og"

}

]

}

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alright, that is pretty much what we've done so far, but I'm looking at
getting "bro", "f", "jump"..... into the index, instead of the endings, And
possibly the original words as well.

On Tue, Feb 26, 2013 at 12:02 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:

Hi

We are discussing building an index where possible misspellings at the
end of a word are getting hits.

We were looking at using the EdgeNGram and making ngrams of the last
two characters, but that gives us an index of just the 2-character
variations of the word endings.

How would we best do this? Is it possible to configure the inverse of
that? Should we tokenize it with a regexp? Any other ideas?

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"end_grams" : {
"max_gram" : 2,
"side" : "back",
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer" : {
"end_grams" : {
"filter" : [
"standard",
"lowercase",
"stop",
"end_grams"
],
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick
+brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams'

{

"tokens" : [

{

"end_offset" : 9,

"position" : 1,

"start_offset" : 7,

"type" : "word",

"token" : "ck"

},

{

"end_offset" : 15,

"position" : 2,

"start_offset" : 13,

"type" : "word",

"token" : "wn"

},

{

"end_offset" : 19,

"position" : 3,

"start_offset" : 17,

"type" : "word",

"token" : "ox"

},

{

"end_offset" : 26,

"position" : 4,

"start_offset" : 24,

"type" : "word",

"token" : "ed"

},

{

"end_offset" : 31,

"position" : 5,

"start_offset" : 29,

"type" : "word",

"token" : "er"

},

{

"end_offset" : 40,

"position" : 6,

"start_offset" : 38,

"type" : "word",

"token" : "zy"

},

{

"end_offset" : 44,

"position" : 7,

"start_offset" : 42,

"type" : "word",

"token" : "og"

}

]

}

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I guess I was really unclear in my original text. I want to know how to
strip the last couple of characters in a word, and also keep the original

On Tuesday, February 26, 2013 12:09:19 PM UTC+1, Per Ekman wrote:

Alright, that is pretty much what we've done so far, but I'm looking at
getting "bro", "f", "jump"..... into the index, instead of the endings, And
possibly the original words as well.

On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:

Hi

We are discussing building an index where possible misspellings at the
end of a word are getting hits.

We were looking at using the EdgeNGram and making ngrams of the last
two characters, but that gives us an index of just the 2-character
variations of the word endings.

How would we best do this? Is it possible to configure the inverse of
that? Should we tokenize it with a regexp? Any other ideas?

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"end_grams" : {
"max_gram" : 2,
"side" : "back",
"min_gram" : 2,
"type" : "edge_ngram"
}
},
"analyzer" : {
"end_grams" : {
"filter" : [
"standard",
"lowercase",
"stop",
"end_grams"
],
"tokenizer" : "standard"
}
}
}
}
}
'

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick
+brown+fox+jumped+over+the+lazy+dog&analyzer=end_gramshttp://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick+brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams
'

{

"tokens" : [

{

"end_offset" : 9,

"position" : 1,

"start_offset" : 7,

"type" : "word",

"token" : "ck"

},

{

"end_offset" : 15,

"position" : 2,

"start_offset" : 13,

"type" : "word",

"token" : "wn"

},

{

"end_offset" : 19,

"position" : 3,

"start_offset" : 17,

"type" : "word",

"token" : "ox"

},

{

"end_offset" : 26,

"position" : 4,

"start_offset" : 24,

"type" : "word",

"token" : "ed"

},

{

"end_offset" : 31,

"position" : 5,

"start_offset" : 29,

"type" : "word",

"token" : "er"

},

{

"end_offset" : 40,

"position" : 6,

"start_offset" : 38,

"type" : "word",

"token" : "zy"

},

{

"end_offset" : 44,

"position" : 7,

"start_offset" : 42,

"type" : "word",

"token" : "og"

}

]

}

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-26 at 12:09 +0100, Per Ekman wrote:

Alright, that is pretty much what we've done so far, but I'm looking
at getting "bro", "f", "jump"..... into the index, instead of the
endings,

You specified that you wanted ngrams of the last two characters, which
is why I set "side" to "back".

And possibly the original words as well.

Just make the edge ngrams long enough.

You may want to use a multi-field to have one field indexed with (eg)
the standard analyzer, and another indexed with edge-ngrams, and you can
query both of them in a single query, giving different boosts to each
clause

clint

On Tue, Feb 26, 2013 at 12:02 PM, Clinton Gormley
clint@traveljury.com wrote:
On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:
> Hi
>
>
> We are discussing building an index where possible
misspellings at the
> end of a word are getting hits.
>
>
> We were looking at using the EdgeNGram and making ngrams of
the last
> two characters, but that gives us an index of just the
2-character
> variations of the word endings.
>
>
> How would we best do this? Is it possible to configure the
inverse of
> that? Should we tokenize it with a regexp? Any other ideas?

    curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
    {
       "settings" : {
          "analysis" : {
             "filter" : {
                "end_grams" : {
                   "max_gram" : 2,
                   "side" : "back",
                   "min_gram" : 2,
                   "type" : "edge_ngram"
                }
             },
             "analyzer" : {
                "end_grams" : {
                   "filter" : [
                      "standard",
                      "lowercase",
                      "stop",
                      "end_grams"
                   ],
                   "tokenizer" : "standard"
                }
             }
          }
       }
    }
    '
    
    curl -XGET
    'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick
    +brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams'
    
    # {
    #    "tokens" : [
    #       {
    #          "end_offset" : 9,
    #          "position" : 1,
    #          "start_offset" : 7,
    #          "type" : "word",
    #          "token" : "ck"
    #       },
    #       {
    #          "end_offset" : 15,
    #          "position" : 2,
    #          "start_offset" : 13,
    #          "type" : "word",
    #          "token" : "wn"
    #       },
    #       {
    #          "end_offset" : 19,
    #          "position" : 3,
    #          "start_offset" : 17,
    #          "type" : "word",
    #          "token" : "ox"
    #       },
    #       {
    #          "end_offset" : 26,
    #          "position" : 4,
    #          "start_offset" : 24,
    #          "type" : "word",
    #          "token" : "ed"
    #       },
    #       {
    #          "end_offset" : 31,
    #          "position" : 5,
    #          "start_offset" : 29,
    #          "type" : "word",
    #          "token" : "er"
    #       },
    #       {
    #          "end_offset" : 40,
    #          "position" : 6,
    #          "start_offset" : 38,
    #          "type" : "word",
    #          "token" : "zy"
    #       },
    #       {
    #          "end_offset" : 44,
    #          "position" : 7,
    #          "start_offset" : 42,
    #          "type" : "word",
    #          "token" : "og"
    #       }
    #    ]
    # }
    
    
    clint
    
    --
    You received this message because you are subscribed to the
    Google Groups "elasticsearch" group.
    To unsubscribe from this group and stop receiving emails from
    it, send an email to elasticsearch
    +unsubscribe@googlegroups.com.
    For more options, visit
    https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tue, 2013-02-26 at 03:13 -0800, Per Ekman wrote:

I guess I was really unclear in my original text. I want to know how
to strip the last couple of characters in a word, and also keep the
original

Ah right

Currently you can't do that in the same field - you can have one field
with the full word, and another field which uses the pattern tokenizer
to drop the last two letters.

I'm hoping to get a token filter accepted which does allow multiple
captures per position in the same field:
https://issues.apache.org/jira/browse/LUCENE-4766
but it'll be a while before that happens

clint

On Tuesday, February 26, 2013 12:09:19 PM UTC+1, Per Ekman wrote:
Alright, that is pretty much what we've done so far, but I'm
looking at getting "bro", "f", "jump"..... into the index,
instead of the endings, And possibly the original words as
well.

            On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:
            > Hi
            >
            >
            > We are discussing building an index where possible
            misspellings at the
            > end of a word are getting hits.
            >
            >
            > We were looking at using the EdgeNGram and making
            ngrams of the last
            > two characters, but that gives us an index of just
            the 2-character
            > variations of the word endings.
            >
            >
            > How would we best do this? Is it possible to
            configure the inverse of
            > that? Should we tokenize it with a regexp? Any other
            ideas?
            
            
            curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d
            '
            {
               "settings" : {
                  "analysis" : {
                     "filter" : {
                        "end_grams" : {
                           "max_gram" : 2,
                           "side" : "back",
                           "min_gram" : 2,
                           "type" : "edge_ngram"
                        }
                     },
                     "analyzer" : {
                        "end_grams" : {
                           "filter" : [
                              "standard",
                              "lowercase",
                              "stop",
                              "end_grams"
                           ],
                           "tokenizer" : "standard"
                        }
                     }
                  }
               }
            }
            '
            
            curl -XGET
            'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The
            +quick
            +brown+fox+jumped+over+the+lazy
            +dog&analyzer=end_grams'
            
            # {
            #    "tokens" : [
            #       {
            #          "end_offset" : 9,
            #          "position" : 1,
            #          "start_offset" : 7,
            #          "type" : "word",
            #          "token" : "ck"
            #       },
            #       {
            #          "end_offset" : 15,
            #          "position" : 2,
            #          "start_offset" : 13,
            #          "type" : "word",
            #          "token" : "wn"
            #       },
            #       {
            #          "end_offset" : 19,
            #          "position" : 3,
            #          "start_offset" : 17,
            #          "type" : "word",
            #          "token" : "ox"
            #       },
            #       {
            #          "end_offset" : 26,
            #          "position" : 4,
            #          "start_offset" : 24,
            #          "type" : "word",
            #          "token" : "ed"
            #       },
            #       {
            #          "end_offset" : 31,
            #          "position" : 5,
            #          "start_offset" : 29,
            #          "type" : "word",
            #          "token" : "er"
            #       },
            #       {
            #          "end_offset" : 40,
            #          "position" : 6,
            #          "start_offset" : 38,
            #          "type" : "word",
            #          "token" : "zy"
            #       },
            #       {
            #          "end_offset" : 44,
            #          "position" : 7,
            #          "start_offset" : 42,
            #          "type" : "word",
            #          "token" : "og"
            #       }
            #    ]
            # }
            
            
            clint
            
            --
            You received this message because you are subscribed
            to the Google Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving
            emails from it, send an email to elasticsearch
            +unsubscribe@googlegroups.com.
            For more options, visit
            https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Cool. Yeah, we were playing around with the pattern tokenizer to achieve
this

On Wed, Feb 27, 2013 at 3:17 PM, Clinton Gormley clint@traveljury.comwrote:

On Tue, 2013-02-26 at 03:13 -0800, Per Ekman wrote:

I guess I was really unclear in my original text. I want to know how
to strip the last couple of characters in a word, and also keep the
original

Ah right

Currently you can't do that in the same field - you can have one field
with the full word, and another field which uses the pattern tokenizer
to drop the last two letters.

I'm hoping to get a token filter accepted which does allow multiple
captures per position in the same field:
[LUCENE-4766] Pattern token filter which emits a token for every capturing group - ASF JIRA
but it'll be a while before that happens

clint

On Tuesday, February 26, 2013 12:09:19 PM UTC+1, Per Ekman wrote:
Alright, that is pretty much what we've done so far, but I'm
looking at getting "bro", "f", "jump"..... into the index,
instead of the endings, And possibly the original words as
well.

            On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:
            > Hi
            >
            >
            > We are discussing building an index where possible
            misspellings at the
            > end of a word are getting hits.
            >
            >
            > We were looking at using the EdgeNGram and making
            ngrams of the last
            > two characters, but that gives us an index of just
            the 2-character
            > variations of the word endings.
            >
            >
            > How would we best do this? Is it possible to
            configure the inverse of
            > that? Should we tokenize it with a regexp? Any other
            ideas?


            curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d
            '
            {
               "settings" : {
                  "analysis" : {
                     "filter" : {
                        "end_grams" : {
                           "max_gram" : 2,
                           "side" : "back",
                           "min_gram" : 2,
                           "type" : "edge_ngram"
                        }
                     },
                     "analyzer" : {
                        "end_grams" : {
                           "filter" : [
                              "standard",
                              "lowercase",
                              "stop",
                              "end_grams"
                           ],
                           "tokenizer" : "standard"
                        }
                     }
                  }
               }
            }
            '

            curl -XGET
            'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The
            +quick
            +brown+fox+jumped+over+the+lazy
            +dog&analyzer=end_grams'

            # {
            #    "tokens" : [
            #       {
            #          "end_offset" : 9,
            #          "position" : 1,
            #          "start_offset" : 7,
            #          "type" : "word",
            #          "token" : "ck"
            #       },
            #       {
            #          "end_offset" : 15,
            #          "position" : 2,
            #          "start_offset" : 13,
            #          "type" : "word",
            #          "token" : "wn"
            #       },
            #       {
            #          "end_offset" : 19,
            #          "position" : 3,
            #          "start_offset" : 17,
            #          "type" : "word",
            #          "token" : "ox"
            #       },
            #       {
            #          "end_offset" : 26,
            #          "position" : 4,
            #          "start_offset" : 24,
            #          "type" : "word",
            #          "token" : "ed"
            #       },
            #       {
            #          "end_offset" : 31,
            #          "position" : 5,
            #          "start_offset" : 29,
            #          "type" : "word",
            #          "token" : "er"
            #       },
            #       {
            #          "end_offset" : 40,
            #          "position" : 6,
            #          "start_offset" : 38,
            #          "type" : "word",
            #          "token" : "zy"
            #       },
            #       {
            #          "end_offset" : 44,
            #          "position" : 7,
            #          "start_offset" : 42,
            #          "type" : "word",
            #          "token" : "og"
            #       }
            #    ]
            # }


            clint

            --
            You received this message because you are subscribed
            to the Google Groups "elasticsearch" group.
            To unsubscribe from this group and stop receiving
            emails from it, send an email to elasticsearch
            +unsubscribe@googlegroups.com.
            For more options, visit
            https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/d85geUwu1WM/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.