Alternative approaches to a query

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by 'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by 'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on the
document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than me
manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.

My cursory first attempt to investigate this type of matching seems to
indicate that the out-of-the-box behavior might be close to what I need
after all. Assuming I leave the search term and the local properties alone,
I am seeing the desired matches and scoring I need. For example, if the
locales are ['en', 'en_US', 'en_GB', 'ar', 'ar_SA', and 'en' (again), the
following searches return these results:

term results in order


en en, en, en_US, en_GB
en_GB en_GB
ar ar, ar_SA
ar_SA ar_SA

So, I guess I got lucky and the out-of-the-box experience is nearly what I
need. I might be getting fooled by my simple test however. I thought that
when I ran an early test for 'en', I wasn't getting the exact match ('en')
to be scored higher than 'en_US', but I can't seem to duplicate that issue.

I would probably want en_GB to match 'en' in the absence of an 'en_GB' exact
match. So, adding the language to the search terms when the user specifies a
lang_country term would alleviate the problem. For example, a search for '
en_GB' would become a search for 'en en_GB'.

The only problem at the moment is the query can't differentiate between
matching on the language or the country. For example, if the language I am
searching for is Sanskrit (sa), it will match on my country name for Saudi
Arabia (also SA but capitalized). If the searches were case sensitive, I
suppose this could be managed.

My gut feel is that doing this right might require a custom analyzer, but I
am willing to explore other approaches if that is unnecessarily involved.

On Sun, Sep 12, 2010 at 1:53 PM, James Cook jcook@tracermedia.com wrote:

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and
    higher than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by
    'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by
    'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on
the document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than me
manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.

In case someone is trying to help, here is a set of curls to set up and test
a couple queries:

curl -XDELETE 'http://localhost:9200/twitter/tweet/_query?q=user:*'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user0",
"locale": "en"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user1",
"locale": "en_US"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user2",
"locale": "en_GB"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user3",
"locale": "ar"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user4",
"locale": "ar_SA"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user5",
"locale": "en"
}'

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=user:*'

curl -XGET '
http://localhost:9200/twitter/tweet/_search?q=locale:ar&pretty=true'

curl -XGET '
http://localhost:9200/twitter/tweet/_search?q=locale:en&pretty=true'

Thanks

On Sun, Sep 12, 2010 at 3:28 PM, James Cook jcook@tracermedia.com wrote:

My cursory first attempt to investigate this type of matching seems to
indicate that the out-of-the-box behavior might be close to what I need
after all. Assuming I leave the search term and the local properties alone,
I am seeing the desired matches and scoring I need. For example, if the
locales are ['en', 'en_US', 'en_GB', 'ar', 'ar_SA', and 'en' (again), the
following searches return these results:

term results in order


en en, en, en_US, en_GB
en_GB en_GB
ar ar, ar_SA
ar_SA ar_SA

So, I guess I got lucky and the out-of-the-box experience is nearly what I
need. I might be getting fooled by my simple test however. I thought that
when I ran an early test for 'en', I wasn't getting the exact match ('en')
to be scored higher than 'en_US', but I can't seem to duplicate that issue.

I would probably want en_GB to match 'en' in the absence of an 'en_GB'
exact match. So, adding the language to the search terms when the user
specifies a lang_country term would alleviate the problem. For example, a
search for 'en_GB' would become a search for 'en en_GB'.

The only problem at the moment is the query can't differentiate between
matching on the language or the country. For example, if the language I am
searching for is Sanskrit (sa), it will match on my country name for Saudi
Arabia (also SA but capitalized). If the searches were case sensitive, I
suppose this could be managed.

My gut feel is that doing this right might require a custom analyzer, but I
am willing to explore other approaches if that is unnecessarily involved.

On Sun, Sep 12, 2010 at 1:53 PM, James Cook jcook@tracermedia.com wrote:

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and
    higher than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by
    'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by
    'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on
the document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than
me manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.