Moving the search from PostgreSQL to elasticsearch

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using the
elastica client. I indexed about a million documents with book titles and
some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid | editionid

languageid
Hamlet
eng
Hamlet by Shakespeare
eng
Hamlet bilingual
eng
Romeo and Juliet
eng
Romeo und Julia
ger
Roméo et Juliette
fre
Othello
eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

  1. editionid 1
    only one result since the bookid shall be unique in a result set and
    edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
    editionid 2 has the score = 100 * 0.036 * 71 = 255
    editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo &
shakespeare':

  1. editionid 5
    because in his first language is ger, score = 100 * 0.068 * 325 = 2211
    editionid 4 has the score = 5 * 0.068 * 1470 = 500
    editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

  1. edition 1
    score = 2 * 0.036 * 2784 = 200
  2. edition 7
    score = 2 * 0.036 * 1548 = 111
  3. edition 5
    score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

sorry for the table layout - where is the preview function?

title | author |
ratings| bookid | editionid | languageid

Hamlet |William Shakespeare | 2784 | 1 |
1 | eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3
| 7 | eng

Am Freitag, 11. Januar 2013 01:57:00 UTC+1 schrieb randel:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid


Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

  1. editionid 1
    only one result since the bookid shall be unique in a result set and
    edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
    editionid 2 has the score = 100 * 0.036 * 71 = 255
    editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

  1. editionid 5
    because in his first language is ger, score = 100 * 0.068 * 325 = 2211
    editionid 4 has the score = 5 * 0.068 * 1470 = 500
    editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

  1. edition 1
    score = 2 * 0.036 * 2784 = 200
  2. edition 7
    score = 2 * 0.036 * 1548 = 111
  3. edition 5
    score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

Hello,

I think you can use the Custom Score query for that:
http://www.elasticsearch.org/guide/reference/query-dsl/custom-score-query.html

The way I'll do it is to compute the score as weight/ratings (where you'd
have to compute weight on its own within the custom score script). This
should give you the same scoring order as your original formula (because
max_r and 100 are constants within the same query).

If you need to show the score as your original formula, I think it's
cheaper to do that in your application. To get the highest rating I'd do a
second query, just to get the highest rating. I'd just use filters[0] and
sorting[1] to get that - in order to maximize the use of caches, and set
size=1 to get only the first result.

[0] http://www.elasticsearch.org/guide/reference/query-dsl/query-filter.html
--> please note (see the bottom of the page) that this filter isn't cached
by default
[1] http://www.elasticsearch.org/guide/reference/api/search/sort.html

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Fri, Jan 11, 2013 at 2:57 AM, randel mandavister@gmail.com wrote:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid


Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

  1. editionid 1
    only one result since the bookid shall be unique in a result set and
    edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
    editionid 2 has the score = 100 * 0.036 * 71 = 255
    editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

  1. editionid 5
    because in his first language is ger, score = 100 * 0.068 * 325 = 2211
    editionid 4 has the score = 5 * 0.068 * 1470 = 500
    editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

  1. edition 1
    score = 2 * 0.036 * 2784 = 200
  2. edition 7
    score = 2 * 0.036 * 1548 = 111
  3. edition 5
    score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

--

Hey Radu,

thanks for your reply. I used an extra query to find the max_value just before realising that it doesn't change the order of scores. Now I use a custom script in order to get the language-dependend scoring:

$sort = array(
"_script" => array(
'script' => " if(doc['languageid'].value==".$userlang[0]."){(doc['bookratings'].value * 100) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[1]."){(doc['bookratings'].value * 5) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[2]."){(doc['bookratings'].value * 3) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[3]."){(doc['bookratings'].value * 2) + (0.9/doc['case'].value);}
else {doc['languageid'].value + (0.9/doc['case'].value);}",
'type' => 'number',
'order' => 'desc'),
);

That seems to work ok, even though it feels a little slower. And to have only the best fitting title shown of all editions of one book I wrote something in the application. It makes pagination more complicated but the grouping functionality isn't ready yet for elasticsearch. Even though it's implemented in Lucene already.

Regards,
Randel

Hey Radu,

thanks for your reply. I used an extra query to find the max_value
just before realising that it doesn't change the order of scores. Now
I use a custom script in order to get the language-dependend scoring:

$sort = array(
"_script" => array(
'script' => " if(doc['languageid'].value==".
$userlang[0]."){(doc['bookratings'].value * 100) + (0.9/
doc['case'].value);}
else if(doc['languageid'].value==".$userlang[1].")
{(doc['bookratings'].value * 5) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[2].")
{(doc['bookratings'].value * 3) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[3].")
{(doc['bookratings'].value * 2) + (0.9/doc['case'].value);}
else {doc['languageid'].value + (0.9/
doc['case'].value);}",
'type' => 'number',
'order' => 'desc'),
);

That seems to work ok, even though it feels a little slower. And to
have only the best fitting title shown of all editions of one book I
wrote something in the application. It makes pagination more
complicated but the grouping functionality isn't ready yet for
elasticsearch. Even though it's implemented in Lucene already.

Regards,
Randel

On 14 Jan., 18:33, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello,

I think you can use the Custom Score query for that:http://www.elasticsearch.org/guide/reference/query-dsl/custom-score-q...

The way I'll do it is to compute the score as weight/ratings (where you'd
have to compute weight on its own within the custom score script). This
should give you the same scoring order as your original formula (because
max_r and 100 are constants within the same query).

If you need to show the score as your original formula, I think it's
cheaper to do that in your application. To get the highest rating I'd do a
second query, just to get the highest rating. I'd just use filters[0] and
sorting[1] to get that - in order to maximize the use of caches, and set
size=1 to get only the first result.

[0]http://www.elasticsearch.org/guide/reference/query-dsl/query-filter.html
--> please note (see the bottom of the page) that this filter isn't cached
by default
[1]http://www.elasticsearch.org/guide/reference/api/search/sort.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Fri, Jan 11, 2013 at 2:57 AM, randel mandavis...@gmail.com wrote:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid


Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

  1. editionid 1
    only one result since the bookid shall be unique in a result set and
    edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
    editionid 2 has the score = 100 * 0.036 * 71 = 255
    editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

  1. editionid 5
    because in his first language is ger, score = 100 * 0.068 * 325 = 2211
    editionid 4 has the score = 5 * 0.068 * 1470 = 500
    editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

  1. edition 1
    score = 2 * 0.036 * 2784 = 200
  2. edition 7
    score = 2 * 0.036 * 1548 = 111
  3. edition 5
    score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

--