Advice about mapping


(Zachary Tong) #1

Hi everyone. I'm relatively new to ElasticSearch (and Lucene in general),
but picking it up fairly quickly. It's a lovely piece of software!

I was curious if I could get some advice about mapping "best practices" for
general search. I am searching an index of product names for matching or
similar products. The mapping that I have on the index is fairly basic:

Effectively, I'm doing basic tokenization and then either phrase matching
or matching term nGrams. I am boosting full phrase matches because that
seems to give the best relevance when people know exactly what they are
searching for, while the low-weighted nGrams help for non-match phrases.

Is there anything that I'm doing sub-optimally or something I should add?
Is it silly to have all three nGrams - front, back and middle? This
search is exclusively for products, so there are a lot of strange queries
("3C 2200mAh Nanotech"). Unfortunately, a lot of those small terms ("3C")
are very important so I have to make sure they aren't filtered out.

Thanks! Any help would be greatly appreciated!
-Zach


(Runar Myklebust-2) #2

I think - based on the assumption that the comparerc.com site based on
this, that you should lowercase index and search. No hits for e.g "wheel"
:slight_smile:

greetings

Runar Myklebust

Enonic AS

An Open Source Company www.enonic.com/download

On Wed, Jun 27, 2012 at 2:21 AM, Zachary Tong zacharyjtong@gmail.comwrote:

Hi everyone. I'm relatively new to ElasticSearch (and Lucene in general),
but picking it up fairly quickly. It's a lovely piece of software!

I was curious if I could get some advice about mapping "best practices"
for general search. I am searching an index of product names for matching
or similar products. The mapping that I have on the index is fairly basic:

https://gist.github.com/bce8e9ce47a96a24e693

Effectively, I'm doing basic tokenization and then either phrase matching
or matching term nGrams. I am boosting full phrase matches because that
seems to give the best relevance when people know exactly what they are
searching for, while the low-weighted nGrams help for non-match phrases.

Is there anything that I'm doing sub-optimally or something I should add?
Is it silly to have all three nGrams - front, back and middle? This
search is exclusively for products, so there are a lot of strange queries
("3C 2200mAh Nanotech"). Unfortunately, a lot of those small terms ("3C")
are very important so I have to make sure they aren't filtered out.

Thanks! Any help would be greatly appreciated!
-Zach


(Zachary Tong) #3

Hehe, that is indeed my site. How'd you find it?

In any case, my analyzers are lowercasing. If you look at the mapping
I'm performing [ "standard", "lowercase", "asciifolding" ] on all the
indexed product names.

I'm also retrieving search results for "wheel"...what browser are you
using? It may just be a problem with my javascript, unrelated to
ElasticSearch.

-Zach

On Friday, June 29, 2012 8:35:23 AM UTC-4, Runar Myklebust wrote:

I think - based on the assumption that the comparerc.com site based on
this, that you should lowercase index and search. No hits for e.g "wheel"
:slight_smile:

greetings

Runar Myklebust

Enonic AS

An Open Source Company www.enonic.com/download

On Wed, Jun 27, 2012 at 2:21 AM, Zachary Tong zacharyjtong@gmail.comwrote:

Hi everyone. I'm relatively new to ElasticSearch (and Lucene in
general), but picking it up fairly quickly. It's a lovely piece of
software!

I was curious if I could get some advice about mapping "best practices"
for general search. I am searching an index of product names for matching
or similar products. The mapping that I have on the index is fairly basic:

https://gist.github.com/bce8e9ce47a96a24e693

Effectively, I'm doing basic tokenization and then either phrase matching
or matching term nGrams. I am boosting full phrase matches because that
seems to give the best relevance when people know exactly what they are
searching for, while the low-weighted nGrams help for non-match phrases.

Is there anything that I'm doing sub-optimally or something I should add?
Is it silly to have all three nGrams - front, back and middle? This
search is exclusively for products, so there are a lot of strange queries
("3C 2200mAh Nanotech"). Unfortunately, a lot of those small terms ("3C")
are very important so I have to make sure they aren't filtered out.

Thanks! Any help would be greatly appreciated!
-Zach


(system) #4