edgeNGram minimum length omits shorter words


(fmpwizard) #1

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram to
words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(simonw-2) #2

Hey, I am afraid this is unfortunately not possible. What I'd do in your
case is use two fields and index one without ngrams and one with ngrams and
search across both. I'd also use ngrams and not edgengrams if you do
fulltext search and set min_gram = max_gram = 3 or maybe even 5? Those
massive edge n-grams or large max values will cause a lot of trouble
scoring wise and create massive posting lists under the hood. I personally
always set them to the same values though.

does this make sense?

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(fmpwizard) #3

Hi Simon,

Thanks for your answer. I'll try the idea of having two fields, and if I
run into any issues I'll post again. About using ngram vs edgeNgram, my
complete analyzer/tokenizer/filter has a few more options than just the
edgeNgram.

But I think that in my case, edgeNgram is doing the right thing for my use
case (but I'm still new to ElasticSearch,, so I may be off). Using the
linked settings, if I analyze a phrase like:

This is a keyboard for toshiba

The analyzed text
curl
"http://localhost:9200/test/_analyze?text=This+is+a+keyboard+for+toshiba&analyzer=description_analyzer&pretty=true"

gives me:

{
"tokens" : [ {
"token" : "thi",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "key",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 3
}, {
"token" : "keyb",
"start_offset" : 10,
"end_offset" : 14,
"type" : "word",
"position" : 4
}, {
"token" : "keybo",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 5
}, {
"token" : "keyboa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 6
}, {
"token" : "keyboar",
"start_offset" : 10,
"end_offset" : 17,
"type" : "word",
"position" : 7
}, {
"token" : "keyboard",
"start_offset" : 10,
"end_offset" : 18,
"type" : "word",
"position" : 8
}, {
"token" : "for",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 9
}, {
"token" : "tos",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 10
}, {
"token" : "tosh",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 11
}, {
"token" : "toshi",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 12
}, {
"token" : "toshib",
"start_offset" : 23,
"end_offset" : 29,
"type" : "word",
"position" : 13
}, {
"token" : "toshiba",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 14
} ]

And then on the search side, I can do

{"fields":["id","part_number","description","qty_available","sale_price_arg","brand","category","subcategory"],"from":0,"size":500,"query":{"bool":{"must":[{"match":{"description":{"query":"key
toshiba","operator":"and"}}}]}}}

and it will find this document, because key is a token, same as toshiba.
But if I do ngram, I would be indexing things like eyb (from keyboard),
and my users will not be searching for such a string

Now that I posted more information about my use case, do you think there is
any other way to solve my use case?

Thanks

Diego

On Sunday, October 14, 2012 8:51:55 AM UTC-4, simonw wrote:

Hey, I am afraid this is unfortunately not possible. What I'd do in your
case is use two fields and index one without ngrams and one with ngrams and
search across both. I'd also use ngrams and not edgengrams if you do
fulltext search and set min_gram = max_gram = 3 or maybe even 5? Those
massive edge n-grams or large max values will cause a lot of trouble
scoring wise and create massive posting lists under the hood. I personally
always set them to the same values though.

does this make sense?

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(simonw-2) #4

Hey,

On Monday, October 15, 2012 6:03:09 AM UTC+2, fmpwizard wrote:

Hi Simon,

Thanks for your answer. I'll try the idea of having two fields, and if I
run into any issues I'll post again. About using ngram vs edgeNgram, my
complete analyzer/tokenizer/filter has a few more options than just the
edgeNgram.

no worries you are welcome!

https://gist.github.com/3890730

ok cool - looks reasonable!

But I think that in my case, edgeNgram is doing the right thing for my use
case (but I'm still new to ElasticSearch,, so I may be off). Using the
linked settings, if I analyze a phrase like:

This is a keyboard for toshiba

The analyzed text
curl "
http://localhost:9200/test/_analyze?text=This+is+a+keyboard+for+toshiba&analyzer=description_analyzer&pretty=true
"

gives me:

{
"tokens" : [ {
"token" : "thi",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "key",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 3
}, {
"token" : "keyb",
"start_offset" : 10,
"end_offset" : 14,
"type" : "word",
"position" : 4
}, {
"token" : "keybo",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 5
}, {
"token" : "keyboa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 6
}, {
"token" : "keyboar",
"start_offset" : 10,
"end_offset" : 17,
"type" : "word",
"position" : 7
}, {
"token" : "keyboard",
"start_offset" : 10,
"end_offset" : 18,
"type" : "word",
"position" : 8
}, {
"token" : "for",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 9
}, {
"token" : "tos",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 10
}, {
"token" : "tosh",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 11
}, {
"token" : "toshi",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 12
}, {
"token" : "toshib",
"start_offset" : 23,
"end_offset" : 29,
"type" : "word",
"position" : 13
}, {
"token" : "toshiba",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 14
} ]

And then on the search side, I can do

{"fields":["id","part_number","description","qty_available","sale_price_arg","brand","category","subcategory"],"from":0,"size":500,"query":{"bool":{"must":[{"match":{"description":{"query":"key
toshiba","operator":"and"}}}]}}}

and it will find this document, because key is a token, same as toshiba.

ok cool so what if I type "thosiba" which seems like a common missspelling?
I am just saying if you already pay the price for ngrams I'd not do it only
for edges.

But if I do ngram, I would be indexing things like eyb (from keyboard),

and my users will not be searching for such a string

well ngrams are not about indexing what people search for its about recall
optimization and not being entirely off if there are smalll spelling errors.
remember that your query is analyzed with the same analyzer as you data so
you are building all the edge ngrams for you query too. In your case "key
toshiba" -> ["key", "tos" , "tosh", "toshi" ...] and ALL of them must match
on the same document since you specify the operator "and". I'd rather do an
OR query with ngrams and "minimum_should_match" set to some reasonable
number than doing the conjunction query you are doing here.

I am happy to help more with this if you think its worth exploring?

simon

Now that I posted more information about my use case, do you think there
is any other way to solve my use case?

Thanks

Diego

On Sunday, October 14, 2012 8:51:55 AM UTC-4, simonw wrote:

Hey, I am afraid this is unfortunately not possible. What I'd do in your
case is use two fields and index one without ngrams and one with ngrams and
search across both. I'd also use ngrams and not edgengrams if you do
fulltext search and set min_gram = max_gram = 3 or maybe even 5? Those
massive edge n-grams or large max values will cause a lot of trouble
scoring wise and create massive posting lists under the hood. I personally
always set them to the same values though.

does this make sense?

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if
I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(fmpwizard) #5

Hi,

and it will find this document, because key is a token, same as toshiba.

ok cool so what if I type "thosiba" which seems like a common
missspelling? I am just saying if you already pay the price for ngrams I'd
not do it only for edges.

Ah, I didn't think of this use case, yes, it would be helpful for my users.

But if I do ngram, I would be indexing things like eyb (from keyboard),

and my users will not be searching for such a string

well ngrams are not about indexing what people search for its about recall
optimization and not being entirely off if there are smalll spelling errors.

thanks for this clarification.

remember that your query is analyzed with the same analyzer as you data so
you are building all the edge ngrams for you query too. In your case "key
toshiba" -> ["key", "tos" , "tosh", "toshi" ...] and ALL of them must match
on the same document since you specify the operator "and". I'd rather do an
OR query with ngrams and "minimum_should_match" set to some reasonable
number than doing the conjunction query you are doing here.

I am happy to help more with this if you think its worth exploring?

Yes please, I really appreciate all your help.
I think at this point I should give the full description of what my
application is about, and how search fits into it.

I'm working on replacing a current application that is an inventory
database, it keeps track of things like stock, price. This is for computer
parts, so there is information like Part number A fits in laptops B, C and
D.

My search form has about 24 fields, some of the fields are: part number,
description, brand, qty on hand, cost, sales price, compatible models (in
which laptop, desktop, server does this one part fit)

For the numeric fields (cost, qty), I have a regular search to match
exactly the number, and I also added a small parser so they can enter
0..10 and it will doa range search, or they can enter <10 and it will do
the right thing.

Now, a normal work flow that my client would do is:

Enter on the description field something like key , then on the
compatible models field, Thinkpad T40 and do a search.
Now, Elasticsearch will have about 300k documents , and I need to only show
those that have the description key or keyboard (if there is something like
a keymain, it is ok to show it. but on the compatible field, I need to only
match text that is Thinkpad T40, and not Thinkpad T4000.

The compatible model fields text looks like:

Thinkpad T40, Thinkpad 600

another document may have Thinkpad T400, Thinkpad E1505, Thinkslim 256

Another posible search on description is 9 cell and it should find all
the documents that have something like
Battery 9 cell
Battery (9 Cells)

but not

Battery 6 cells

I'm worry that if I use ngram and the OR operator, it will find more
results than the ones the user expects. Do you think that I should use
ngram to analyze the data, but something else for search analyzer?

Thank you and i'll be happy to provide more details if you need them.

Diego

simon

Now that I posted more information about my use case, do you think there
is any other way to solve my use case?

Thanks

Diego

On Sunday, October 14, 2012 8:51:55 AM UTC-4, simonw wrote:

Hey, I am afraid this is unfortunately not possible. What I'd do in your
case is use two fields and index one without ngrams and one with ngrams and
search across both. I'd also use ngrams and not edgengrams if you do
fulltext search and set min_gram = max_gram = 3 or maybe even 5? Those
massive edge n-grams or large max values will cause a lot of trouble
scoring wise and create massive posting lists under the hood. I personally
always set them to the same values though.

does this make sense?

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like
this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if
I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply
edgeNgram to words that are 3 characters or longer, but if it finds a one
or two characters long word, index the full word, so I can search for 8x
and just get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(simonw-2) #6

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams as I
said optimize for recall so you might want to get precision back since you
might get a lot of documents that are not really relevant. I assume you are
showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords on
the top level boolean query. For the ngram field I'd add minimum_must_match
based on percentages of terms that are generated like "minimum_must_match"
= "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87% 8<88% 9<90%" you might need
to play around with the percentage though. This should give you the really
good matches right at the top and something that is slightly off should
score lower.

something I do sometimes too is to prefix / suffix the end of a token to
get more precision and make those terms mandatory ie. "drive" -> "drive"
-> ["dr", "ri", "iv", "ve"] but this would involve coding since there are
no filters that do that out of the box neither is there query support....
yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(fmpwizard) #7

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams as I
said optimize for recall so you might want to get precision back since you
might get a lot of documents that are not really relevant. I assume you are
showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords on
the top level boolean query. For the ngram field I'd add minimum_must_match
based on percentages of terms that are generated like "minimum_must_match"
= "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87% 8<88% 9<90%" you might need
to play around with the percentage though. This should give you the really
good matches right at the top and something that is slightly off should
score lower.

something I do sometimes too is to prefix / suffix the end of a token to
get more precision and make those terms mandatory ie. "drive" -> "drive"
-> ["dr", "ri", "iv", "ve"] but this would involve coding since there are
no filters that do that out of the box neither is there query support....
yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(fmpwizard) #8

Hi Simon,

I tried what you suggested, but I'm not getting the results I was
expecting, some searches work as expected, but not others. Maybe I missed
something from your previous emails.
I made this gist

that you can run and see exactly what is going on.

So, for the query string tosh , it is not finding the documents with
toshiba in them. but at least searching for 6 cell works as expected now.
Searching for LiION does not return results, but searching for Li-ION does.

Thank you

Diego

On Wednesday, October 17, 2012 9:37:58 PM UTC-4, fmpwizard wrote:

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams as
I said optimize for recall so you might want to get precision back since
you might get a lot of documents that are not really relevant. I assume you
are showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords on
the top level boolean query. For the ngram field I'd add minimum_must_match
based on percentages of terms that are generated like "minimum_must_match"
= "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87% 8<88% 9<90%" you might
need to play around with the percentage though. This should give you the
really good matches right at the top and something that is slightly off
should score lower.

something I do sometimes too is to prefix / suffix the end of a token to
get more precision and make those terms mandatory ie. "drive" -> "drive"
-> ["dr", "ri", "iv", "ve"] but this would involve coding since there are
no filters that do that out of the box neither is there query support....
yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if
I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(simonw-2) #9

hey man,

  1. I would change is the word delimiter filter should not concatenate in
    the shingle case.
  2. Don't preserve the original in the shingle case
  3. use a multi_field where one field is shingle and the other is ngram
  4. use default operator OR and set minimum should match to something like
    70% to start with
  5. always search against both fields

can you try this first? It might make sense to increase the ngram size to 4
it will likely give you better results. But you should play around with it
a bit

simon

On Thursday, October 18, 2012 6:12:10 AM UTC+2, fmpwizard wrote:

Hi Simon,

I tried what you suggested, but I'm not getting the results I was
expecting, some searches work as expected, but not others. Maybe I missed
something from your previous emails.
I made this gist
https://gist.github.com/3909810

that you can run and see exactly what is going on.

So, for the query string tosh , it is not finding the documents with
toshiba in them. but at least searching for 6 cell works as expected now.
Searching for LiION does not return results, but searching for Li-ION does.

Thank you

Diego

On Wednesday, October 17, 2012 9:37:58 PM UTC-4, fmpwizard wrote:

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams as
I said optimize for recall so you might want to get precision back since
you might get a lot of documents that are not really relevant. I assume you
are showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords on
the top level boolean query. For the ngram field I'd add minimum_must_match
based on percentages of terms that are generated like "minimum_must_match"
= "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87% 8<88% 9<90%" you might
need to play around with the percentage though. This should give you the
really good matches right at the top and something that is slightly off
should score lower.

something I do sometimes too is to prefix / suffix the end of a token to
get more precision and make those terms mandatory ie. "drive" -> "drive"
-> ["dr", "ri", "iv", "ve"] but this would involve coding since there are
no filters that do that out of the box neither is there query support....
yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like
this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so if
I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply
edgeNgram to words that are 3 characters or longer, but if it finds a one
or two characters long word, index the full word, so I can search for 8x
and just get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(BillyEm) #10

jesus. you want your ngraming to be so abstract that it doesn't have a
configurable language behind it? Go fix the query parser.

On Thursday, October 18, 2012 3:44:14 AM UTC-4, simonw wrote:

hey man,

  1. I would change is the word delimiter filter should not concatenate in
    the shingle case.
  2. Don't preserve the original in the shingle case
  3. use a multi_field where one field is shingle and the other is ngram
  4. use default operator OR and set minimum should match to something like
    70% to start with
  5. always search against both fields

can you try this first? It might make sense to increase the ngram size to
4 it will likely give you better results. But you should play around with
it a bit

simon

On Thursday, October 18, 2012 6:12:10 AM UTC+2, fmpwizard wrote:

Hi Simon,

I tried what you suggested, but I'm not getting the results I was
expecting, some searches work as expected, but not others. Maybe I missed
something from your previous emails.
I made this gist
https://gist.github.com/3909810

that you can run and see exactly what is going on.

So, for the query string tosh , it is not finding the documents with
toshiba in them. but at least searching for 6 cell works as expected now.
Searching for LiION does not return results, but searching for Li-ION
does.

Thank you

Diego

On Wednesday, October 17, 2012 9:37:58 PM UTC-4, fmpwizard wrote:

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams
as I said optimize for recall so you might want to get precision back since
you might get a lot of documents that are not really relevant. I assume you
are showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords
on the top level boolean query. For the ngram field I'd add
minimum_must_match based on percentages of terms that are generated like
"minimum_must_match" = "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87%
8<88% 9<90%" you might need to play around with the percentage though.
This should give you the really good matches right at the top and something
that is slightly off should score lower.

something I do sometimes too is to prefix / suffix the end of a token
to get more precision and make those terms mandatory ie. "drive" ->
"drive" -> ["dr", "ri", "iv", "ve"] but this would involve coding since
there are no filters that do that out of the box neither is there query
support.... yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like
this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so
if I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply
edgeNgram to words that are 3 characters or longer, but if it finds a one
or two characters long word, index the full word, so I can search for 8x
and just get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(BillyEm) #11

Sorry. I should be more clear. Did you ever read the original n-gram of
text search by LANL?Or was it Sandia? Probably together. high energy
physicists refer to this as the "normal" behavior (not the problem here,
but the problem of finding the right sampling method). Unfortunately when
you deal with human participants you get other distributions. lottery of
them.

b

On Thursday, October 18, 2012 9:19:29 PM UTC-4, BillyEm wrote:

jesus. you want your ngraming to be so abstract that it doesn't have a
configurable language behind it? Go fix the query parser.

On Thursday, October 18, 2012 3:44:14 AM UTC-4, simonw wrote:

hey man,

  1. I would change is the word delimiter filter should not concatenate in
    the shingle case.
  2. Don't preserve the original in the shingle case
  3. use a multi_field where one field is shingle and the other is ngram
  4. use default operator OR and set minimum should match to something like
    70% to start with
  5. always search against both fields

can you try this first? It might make sense to increase the ngram size to
4 it will likely give you better results. But you should play around with
it a bit

simon

On Thursday, October 18, 2012 6:12:10 AM UTC+2, fmpwizard wrote:

Hi Simon,

I tried what you suggested, but I'm not getting the results I was
expecting, some searches work as expected, but not others. Maybe I missed
something from your previous emails.
I made this gist
https://gist.github.com/3909810

that you can run and see exactly what is going on.

So, for the query string tosh , it is not finding the documents with
toshiba in them. but at least searching for 6 cell works as expected now.
Searching for LiION does not return results, but searching for Li-ION
does.

Thank you

Diego

On Wednesday, October 17, 2012 9:37:58 PM UTC-4, fmpwizard wrote:

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams
as I said optimize for recall so you might want to get precision back since
you might get a lot of documents that are not really relevant. I assume you
are showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords
on the top level boolean query. For the ngram field I'd add
minimum_must_match based on percentages of terms that are generated like
"minimum_must_match" = "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87%
8<88% 9<90%" you might need to play around with the percentage
though. This should give you the really good matches right at the top and
something that is slightly off should score lower.

something I do sometimes too is to prefix / suffix the end of a token
to get more precision and make those terms mandatory ie. "drive" ->
"drive" -> ["dr", "ri", "iv", "ve"] but this would involve coding since
there are no filters that do that out of the box neither is there query
support.... yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like
this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so
if I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply
edgeNgram to words that are 3 characters or longer, but if it finds a one
or two characters long word, index the full word, so I can search for 8x
and just get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(fmpwizard) #12

@Simon,

I updated the gist

Some comments:

I wasn;t using the word delimiter filter for the shingle case (or am I?) I
thought that by using:

      "description_shingle_analyzer" : {
        "type" : "custom",
        "tokenizer" : "description_tokenizer",
        "filter" : [ "desc_shingle", "lowercase"]
      }

I am only using the filters I list in the "filter" entry

Could you see if my gist is doing everything you asked me to do? I think I
followed all your 5 points, but it would be great if you could check.

The result, we are almost there! I can search and get correct results for
terms like:

cell / cells / liion / li-ion but it fails when I search for 6 cell , it
finds the document with 8 Cells :frowning:

Thanks

Diego

On Thursday, October 18, 2012 9:23:56 PM UTC-4, BillyEm wrote:

Sorry. I should be more clear. Did you ever read the original n-gram of
text search by LANL?Or was it Sandia? Probably together. high energy
physicists refer to this as the "normal" behavior (not the problem here,
but the problem of finding the right sampling method). Unfortunately when
you deal with human participants you get other distributions. lottery of
them.

Is this meant for me?

b

On Thursday, October 18, 2012 9:19:29 PM UTC-4, BillyEm wrote:

jesus. you want your ngraming to be so abstract that it doesn't have a
configurable language behind it? Go fix the query parser.

On Thursday, October 18, 2012 3:44:14 AM UTC-4, simonw wrote:

hey man,

  1. I would change is the word delimiter filter should not concatenate in
    the shingle case.
  2. Don't preserve the original in the shingle case
  3. use a multi_field where one field is shingle and the other is ngram
  4. use default operator OR and set minimum should match to something
    like 70% to start with
  5. always search against both fields

can you try this first? It might make sense to increase the ngram size
to 4 it will likely give you better results. But you should play around
with it a bit

simon

On Thursday, October 18, 2012 6:12:10 AM UTC+2, fmpwizard wrote:

Hi Simon,

I tried what you suggested, but I'm not getting the results I was
expecting, some searches work as expected, but not others. Maybe I missed
something from your previous emails.
I made this gist
https://gist.github.com/3909810

that you can run and see exactly what is going on.

So, for the query string tosh , it is not finding the documents with
toshiba in them. but at least searching for 6 cell works as expected now.
Searching for LiION does not return results, but searching for Li-ION
does.

Thank you

Diego

On Wednesday, October 17, 2012 9:37:58 PM UTC-4, fmpwizard wrote:

Thanks, I'll try this out tonight and let you know how it goes.

Diego

On Monday, October 15, 2012 3:16:22 PM UTC-4, simonw wrote:

hey diego,

it seems like the ngram approach would work fine for you. Yet, ngrams
as I said optimize for recall so you might want to get precision back since
you might get a lot of documents that are not really relevant. I assume you
are showing all results right? My approach would be to use 2 fields for you
description one holds ngrams and the other holds shingles (term ngrams)
like given this document description "8X DVD Drive" you would get "8xdvd",
"8x", "dvddrive", "dvd", "drive". (use shingle filter and use "" as a
separator max_shingle_size = min_shingle_size = 2).
Then you combine the two fields in a boolean query and disable coords
on the top level boolean query. For the ngram field I'd add
minimum_must_match based on percentages of terms that are generated like
"minimum_must_match" = "1<100% 2<66% 3<75% 4<80% 5<83% 6<85% 7<87%
8<88% 9<90%" you might need to play around with the percentage
though. This should give you the really good matches right at the top and
something that is slightly off should score lower.

something I do sometimes too is to prefix / suffix the end of a token
to get more precision and make those terms mandatory ie. "drive" ->
"drive" -> ["dr", "ri", "iv", "ve"] but this would involve coding since
there are no filters that do that out of the box neither is there query
support.... yet :slight_smile:

if you have question, lemme know!

simon

On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:

Hi,

I have a text field like this:

8X DVD Drive
(and about 300k entries to index), so I started using a filter like
this:

"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}

so I could search for dri and it will find drive.

the problem is, because the min_gram is 3, the 8x is not indexed, so
if I have two entries:

8X DVD Drive
2X DVD Drive

and I search for 8x DVD , both documents are returned.

In plain english, I would like top tell ElasticSearch to apply
edgeNgram to words that are 3 characters or longer, but if it finds a one
or two characters long word, index the full word, so I can search for 8x
and just get the one doc with 8x.

Is this possible?

Thanks

Diego

--


(system) #13