Alternative approaches to a query


(James Cook) #1

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by 'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by 'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on the
document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than me
manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.


(James Cook) #2

My cursory first attempt to investigate this type of matching seems to
indicate that the out-of-the-box behavior might be close to what I need
after all. Assuming I leave the search term and the local properties alone,
I am seeing the desired matches and scoring I need. For example, if the
locales are ['en', 'en_US', 'en_GB', 'ar', 'ar_SA', and 'en' (again), the
following searches return these results:

term results in order


en en, en, en_US, en_GB
en_GB en_GB
ar ar, ar_SA
ar_SA ar_SA

So, I guess I got lucky and the out-of-the-box experience is nearly what I
need. I might be getting fooled by my simple test however. I thought that
when I ran an early test for 'en', I wasn't getting the exact match ('en')
to be scored higher than 'en_US', but I can't seem to duplicate that issue.

I would probably want en_GB to match 'en' in the absence of an 'en_GB' exact
match. So, adding the language to the search terms when the user specifies a
lang_country term would alleviate the problem. For example, a search for '
en_GB' would become a search for 'en en_GB'.

The only problem at the moment is the query can't differentiate between
matching on the language or the country. For example, if the language I am
searching for is Sanskrit (sa), it will match on my country name for Saudi
Arabia (also SA but capitalized). If the searches were case sensitive, I
suppose this could be managed.

My gut feel is that doing this right might require a custom analyzer, but I
am willing to explore other approaches if that is unnecessarily involved.

On Sun, Sep 12, 2010 at 1:53 PM, James Cook jcook@tracermedia.com wrote:

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and
    higher than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by
    'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by
    'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on
the document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than me
manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.


(James Cook) #3

In case someone is trying to help, here is a set of curls to set up and test
a couple queries:

curl -XDELETE 'http://localhost:9200/twitter/tweet/_query?q=user:*'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user0",
"locale": "en"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user1",
"locale": "en_US"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user2",
"locale": "en_GB"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user3",
"locale": "ar"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user4",
"locale": "ar_SA"
}'

curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '
{
"user": "user5",
"locale": "en"
}'

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=user:*'

curl -XGET '
http://localhost:9200/twitter/tweet/_search?q=locale:ar&pretty=true'

curl -XGET '
http://localhost:9200/twitter/tweet/_search?q=locale:en&pretty=true'

Thanks

On Sun, Sep 12, 2010 at 3:28 PM, James Cook jcook@tracermedia.com wrote:

My cursory first attempt to investigate this type of matching seems to
indicate that the out-of-the-box behavior might be close to what I need
after all. Assuming I leave the search term and the local properties alone,
I am seeing the desired matches and scoring I need. For example, if the
locales are ['en', 'en_US', 'en_GB', 'ar', 'ar_SA', and 'en' (again), the
following searches return these results:

term results in order


en en, en, en_US, en_GB
en_GB en_GB
ar ar, ar_SA
ar_SA ar_SA

So, I guess I got lucky and the out-of-the-box experience is nearly what I
need. I might be getting fooled by my simple test however. I thought that
when I ran an early test for 'en', I wasn't getting the exact match ('en')
to be scored higher than 'en_US', but I can't seem to duplicate that issue.

I would probably want en_GB to match 'en' in the absence of an 'en_GB'
exact match. So, adding the language to the search terms when the user
specifies a lang_country term would alleviate the problem. For example, a
search for 'en_GB' would become a search for 'en en_GB'.

The only problem at the moment is the query can't differentiate between
matching on the language or the country. For example, if the language I am
searching for is Sanskrit (sa), it will match on my country name for Saudi
Arabia (also SA but capitalized). If the searches were case sensitive, I
suppose this could be managed.

My gut feel is that doing this right might require a custom analyzer, but I
am willing to explore other approaches if that is unnecessarily involved.

On Sun, Sep 12, 2010 at 1:53 PM, James Cook jcook@tracermedia.com wrote:

I have a multilingual datastore of documents. After filtering by other
criteria, I need to find the document than best matches a particular locale
specified by the user. As an example, assume I have narrowed by choices down
to four documents, each with one of the following locale properties:

  • doc1.locale = 'en_US';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB';
  • doc4.locale = 'ar_SA';

I'd like to devise a query which will return the results ranked where a
specific language/country beats out just a match on language.

  1. A query for 'en' will score 'en_US' and 'en_GB' the same and higher
    than the rest.
  2. A query for 'en_GB' will score 'en_GB' the highest, 'en_US' next.
  3. A query for 'en_GU' will score 'en_US' and 'en_GB' the same and
    higher than the rest.
  4. A query for 'ar' will score 'ar' the highest, followed by 'ar_SA'.
  5. A query for 'ar_SA' will score 'ar_SA' the highest, followed by
    'ar'.
  6. A query for 'ar_LB' will score 'ar' the highest, followed by
    'ar_SA'.

I suppose I can achieve this sort order by breaking the locale stored on
the document into two terms when the language and country are included. For
example:

  • doc1.locale = 'en_US en';
  • doc2.locale = 'ar';
  • doc3.locale = 'en_GB en';
  • doc4.locale = 'ar_SA ar';

Then perhaps I can do something similar with the search term, so the term
'en_GB' becomes 'en en_GB' with a boost on the more precise match.

But I was wondering if there was a better way to perform this query than
me manually mangling the search term and the locales stored in each of my
objects.

Thanks for any help.


(system) #4