Hi Simon,
Thanks for your answer. I'll try the idea of having two fields, and if I
run into any issues I'll post again. About using ngram vs edgeNgram, my
complete analyzer/tokenizer/filter has a few more options than just the
edgeNgram.
But I think that in my case, edgeNgram is doing the right thing for my use
case (but I'm still new to Elasticsearch,, so I may be off). Using the
linked settings, if I analyze a phrase like:
This is a keyboard for toshiba
The analyzed text
curl
"http://localhost:9200/test/_analyze?text=This+is+a+keyboard+for+toshiba&analyzer=description_analyzer&pretty=true"
gives me:
{
"tokens" : [ {
"token" : "thi",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "key",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 3
}, {
"token" : "keyb",
"start_offset" : 10,
"end_offset" : 14,
"type" : "word",
"position" : 4
}, {
"token" : "keybo",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 5
}, {
"token" : "keyboa",
"start_offset" : 10,
"end_offset" : 16,
"type" : "word",
"position" : 6
}, {
"token" : "keyboar",
"start_offset" : 10,
"end_offset" : 17,
"type" : "word",
"position" : 7
}, {
"token" : "keyboard",
"start_offset" : 10,
"end_offset" : 18,
"type" : "word",
"position" : 8
}, {
"token" : "for",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 9
}, {
"token" : "tos",
"start_offset" : 23,
"end_offset" : 26,
"type" : "word",
"position" : 10
}, {
"token" : "tosh",
"start_offset" : 23,
"end_offset" : 27,
"type" : "word",
"position" : 11
}, {
"token" : "toshi",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 12
}, {
"token" : "toshib",
"start_offset" : 23,
"end_offset" : 29,
"type" : "word",
"position" : 13
}, {
"token" : "toshiba",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 14
} ]
And then on the search side, I can do
{"fields":["id","part_number","description","qty_available","sale_price_arg","brand","category","subcategory"],"from":0,"size":500,"query":{"bool":{"must":[{"match":{"description":{"query":"key
toshiba","operator":"and"}}}]}}}
and it will find this document, because key is a token, same as toshiba.
But if I do ngram, I would be indexing things like eyb (from keyboard),
and my users will not be searching for such a string
Now that I posted more information about my use case, do you think there is
any other way to solve my use case?
Thanks
Diego
On Sunday, October 14, 2012 8:51:55 AM UTC-4, simonw wrote:
Hey, I am afraid this is unfortunately not possible. What I'd do in your
case is use two fields and index one without ngrams and one with ngrams and
search across both. I'd also use ngrams and not edgengrams if you do
fulltext search and set min_gram = max_gram = 3 or maybe even 5? Those
massive edge n-grams or large max values will cause a lot of trouble
scoring wise and create massive posting lists under the hood. I personally
always set them to the same values though.
does this make sense?
simon
On Sunday, October 14, 2012 7:24:54 AM UTC+2, fmpwizard wrote:
Hi,
I have a text field like this:
8X DVD Drive
(and about 300k entries to index), so I started using a filter like this:
"edgeNgram_descr" : {
"type" : "edgeNGram",
"min_gram" : 3,
"max_gram" : 255,
"side" : "front"
}
so I could search for dri and it will find drive.
the problem is, because the min_gram is 3, the 8x is not indexed, so if I
have two entries:
8X DVD Drive
2X DVD Drive
and I search for 8x DVD , both documents are returned.
In plain english, I would like top tell Elasticsearch to apply edgeNgram
to words that are 3 characters or longer, but if it finds a one or two
characters long word, index the full word, so I can search for 8x and just
get the one doc with 8x.
Is this possible?
Thanks
Diego
--