My cursory first attempt to investigate this type of matching seems to
indicate that the out-of-the-box behavior might be close to what I need
after all. Assuming I leave the search term and the local properties alone,
I am seeing the desired matches and scoring I need. For example, if the
locales are ['en', 'en_US', 'en_GB', 'ar', 'ar_SA', and 'en' (again), the
following searches return these results:
term results in order
en en, en, en_US, en_GB
en_GB en_GB
ar ar, ar_SA
ar_SA ar_SA
So, I guess I got lucky and the out-of-the-box experience is nearly what I
need. I might be getting fooled by my simple test however. I thought that
when I ran an early test for 'en', I wasn't getting the exact match ('en')
to be scored higher than 'en_US', but I can't seem to duplicate that issue.
I would probably want en_GB to match 'en' in the absence of an 'en_GB' exact
match. So, adding the language to the search terms when the user specifies a
lang_country term would alleviate the problem. For example, a search for '
en_GB' would become a search for 'en en_GB'.
The only problem at the moment is the query can't differentiate between
matching on the language or the country. For example, if the language I am
searching for is Sanskrit (sa), it will match on my country name for Saudi
Arabia (also SA but capitalized). If the searches were case sensitive, I
suppose this could be managed.
My gut feel is that doing this right might require a custom analyzer, but I
am willing to explore other approaches if that is unnecessarily involved.
On Sun, Sep 12, 2010 at 1:53 PM, James Cook jcook@tracermedia.com wrote:
I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:
- doc1.locale = 'en_US';
- doc2.locale = 'ar';
- doc3.locale = 'en_GB';
- doc4.locale = 'ar_SA';
I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.
- A query for 'en' will score 'en_US' and 'en_GB' the same and higher
than the rest.
- A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
- A query for 'en_GU' will score 'en_US' and 'en_GB' the same and
higher than the rest.
- A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
- A query for 'ar_SA' will score 'ar_SA' the highest, followed by
'ar'.
- A query for 'ar_LB' will score 'ar' the highest, followed by
'ar_SA'.
I suppose I can achieve this sort order by breaking the locale stored on
the document into two terms when the language and country are included. For
example:
- doc1.locale = 'en_US en';
- doc2.locale = 'ar';
- doc3.locale = 'en_GB en';
- doc4.locale = 'ar_SA ar';
Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.
But I was wondering if there was a better way to perform this query than me
manually mangling the search term and the locales stored in each of my
objects.
Thanks for any help.