Moving the search from PostgreSQL to elasticsearch

randel_2 · January 11, 2013, 12:57am

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using the
elastica client. I indexed about a million documents with book titles and
some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid | editionid

languageid
Hamlet
eng
Hamlet by Shakespeare
eng
Hamlet bilingual
eng
Romeo and Juliet
eng
Romeo und Julia
ger
Roméo et Juliette
fre
Othello
eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

editionid 1
only one result since the bookid shall be unique in a result set and
edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
editionid 2 has the score = 100 * 0.036 * 71 = 255
editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo &
shakespeare':

editionid 5
because in his first language is ger, score = 100 * 0.068 * 325 = 2211
editionid 4 has the score = 5 * 0.068 * 1470 = 500
editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

edition 1
score = 2 * 0.036 * 2784 = 200
edition 7
score = 2 * 0.036 * 1548 = 111
edition 5
score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

randel_2 · January 11, 2013, 1:02am

sorry for the table layout - where is the preview function?

title | author |
ratings| bookid | editionid | languageid

Hamlet |William Shakespeare | 2784 | 1 |
1 | eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3
| 7 | eng

Am Freitag, 11. Januar 2013 01:57:00 UTC+1 schrieb randel:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid

Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

editionid 1
only one result since the bookid shall be unique in a result set and
edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
editionid 2 has the score = 100 * 0.036 * 71 = 255
editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

editionid 5
because in his first language is ger, score = 100 * 0.068 * 325 = 2211
editionid 4 has the score = 5 * 0.068 * 1470 = 500
editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

edition 1
score = 2 * 0.036 * 2784 = 200

edition 7
score = 2 * 0.036 * 1548 = 111

edition 5
score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

radu_gheorghe · January 14, 2013, 5:33pm

Hello,

I think you can use the Custom Score query for that:

The way I'll do it is to compute the score as weight/ratings (where you'd
have to compute weight on its own within the custom score script). This
should give you the same scoring order as your original formula (because
max_r and 100 are constants within the same query).

If you need to show the score as your original formula, I think it's
cheaper to do that in your application. To get the highest rating I'd do a
second query, just to get the highest rating. I'd just use filters[0] and
sorting[1] to get that - in order to maximize the use of caches, and set
size=1 to get only the first result.

[0] Elastic — The Search AI Company | Elastic
--> please note (see the bottom of the page) that this filter isn't cached
by default
[1] Elastic — The Search AI Company | Elastic

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Fri, Jan 11, 2013 at 2:57 AM, randel mandavister@gmail.com wrote:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid

Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

editionid 1
only one result since the bookid shall be unique in a result set and
edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
editionid 2 has the score = 100 * 0.036 * 71 = 255
editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

editionid 5
because in his first language is ger, score = 100 * 0.068 * 325 = 2211
editionid 4 has the score = 5 * 0.068 * 1470 = 500
editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

edition 1
score = 2 * 0.036 * 2784 = 200

edition 7
score = 2 * 0.036 * 1548 = 111

edition 5
score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

--

randel · January 15, 2013, 4:25pm

Hey Radu,

thanks for your reply. I used an extra query to find the max_value just before realising that it doesn't change the order of scores. Now I use a custom script in order to get the language-dependend scoring:

$sort = array(
"_script" => array(
'script' => " if(doc['languageid'].value==".$userlang[0]."){(doc['bookratings'].value * 100) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[1]."){(doc['bookratings'].value * 5) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[2]."){(doc['bookratings'].value * 3) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[3]."){(doc['bookratings'].value * 2) + (0.9/doc['case'].value);}
else {doc['languageid'].value + (0.9/doc['case'].value);}",
'type' => 'number',
'order' => 'desc'),
);

That seems to work ok, even though it feels a little slower. And to have only the best fitting title shown of all editions of one book I wrote something in the application. It makes pagination more complicated but the grouping functionality isn't ready yet for elasticsearch. Even though it's implemented in Lucene already.

Regards,
Randel

randel_2 · January 15, 2013, 4:26pm

Hey Radu,

thanks for your reply. I used an extra query to find the max_value
just before realising that it doesn't change the order of scores. Now
I use a custom script in order to get the language-dependend scoring:

$sort = array(
"_script" => array(
'script' => " if(doc['languageid'].value==".
$userlang[0]."){(doc['bookratings'].value * 100) + (0.9/
doc['case'].value);}
else if(doc['languageid'].value==".$userlang[1].")
{(doc['bookratings'].value * 5) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[2].")
{(doc['bookratings'].value * 3) + (0.9/doc['case'].value);}
else if(doc['languageid'].value==".$userlang[3].")
{(doc['bookratings'].value * 2) + (0.9/doc['case'].value);}
else {doc['languageid'].value + (0.9/
doc['case'].value);}",
'type' => 'number',
'order' => 'desc'),
);

That seems to work ok, even though it feels a little slower. And to
have only the best fitting title shown of all editions of one book I
wrote something in the application. It makes pagination more
complicated but the grouping functionality isn't ready yet for
elasticsearch. Even though it's implemented in Lucene already.

Regards,
Randel

On 14 Jan., 18:33, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello,

I think you can use the Custom Score query for that:Elasticsearch Platform — Find real-time answers at scale | Elastic...

The way I'll do it is to compute the score as weight/ratings (where you'd
have to compute weight on its own within the custom score script). This
should give you the same scoring order as your original formula (because
max_r and 100 are constants within the same query).

If you need to show the score as your original formula, I think it's
cheaper to do that in your application. To get the highest rating I'd do a
second query, just to get the highest rating. I'd just use filters[0] and
sorting[1] to get that - in order to maximize the use of caches, and set
size=1 to get only the first result.

[0]Elasticsearch Platform — Find real-time answers at scale | Elastic
--> please note (see the bottom of the page) that this filter isn't cached
by default
[1]Elasticsearch Platform — Find real-time answers at scale | Elastic

Best regards,
Radu
--http://sematext.com/-- Elasticsearch -- Solr -- Lucene

On Fri, Jan 11, 2013 at 2:57 AM, randel mandavis...@gmail.com wrote:

Hey, I was using PostgreSQL for searching so far since my applications
relies on it anyway. I found a way for scoring which I am pretty happy with
but search times are just inacceptable since it is searching in a table
with about 9 mio rows.

I started with elasticsearch just a few days ago in a PHP Project using
the elastica client. I indexed about a million documents with book titles
and some related information using the jdbc plugin but don't know if the
scoring I want is possible at all with elasticsearch. One maybe rather
unusual requirement is to different scoring for different users depending
on their language skills.

I have the following index of books which different editions each:

title | author | ratings | bookid |
editionid | languageid

Hamlet |William Shakespeare | 2784 | 1 | 1
| eng
Hamlet by Shakespeare |William Shakespeare | 71 | 1 | 2
| eng
Hamlet bilingual |William Shakespeare | 144 | 1 | 3
| eng
Romeo and Juliet |William Shakespeare | 1470 | 2 | 4
| eng
Romeo und Julia |William Shakespeare | 325 | 2 | 5
| ger
Roméo et Juliette |William Shakespeare | 487 | 2 | 6
| fre
Othello |William Shakespeare | 1548 | 3 | 7
| eng

I want only one title of each bookid in the results - the one with the
highest score which is calculated as follows:

Take the highest ratings count of all results and call it max_r.
Put a weight on the language - 100 for the uses first language, 5 for his
second spoken language, 3 for the third and 2 for the forth (in case
someone is so blessed with language skills) and call it lang_weight. I have
the information of languages for logged in users.

score = weight * 100/max_r * ratings

Is there a way to get the max_r during the query and use it like in sql?
How can I tell elasticsearch to use the weight based on the languageid?
And I still would like to inegrate fuzzy search.

Here some examples of what I would like to get. Depending on the user I
would get following results when searching:

english user searching for 'hamlet':

editionid 1
only one result since the bookid shall be unique in a result set and
edition 1 has the highest order, score = 100 * 0.036 * 2784 = 10,000
editionid 2 has the score = 100 * 0.036 * 71 = 255
editionid 3 has the score = 100 * 0.036 * 144 = 517

german user (with second language eng and third fre) searching for 'romeo
& shakespeare':

editionid 5
because in his first language is ger, score = 100 * 0.068 * 325 = 2211
editionid 4 has the score = 5 * 0.068 * 1470 = 500
editionid 6 has the score = 3 * 0.068 * 487 = 99

spanish user (with second language german, third french, forth english)
searching for 'Shakespeare':

edition 1
score = 2 * 0.036 * 2784 = 200

edition 7
score = 2 * 0.036 * 1548 = 111

edition 5
score = 5 * 0.036 * 325 = 58

I would be very happy if you could point me to the right direction since
this is all quite confusing for me so far.
Thanks a lot.

--

--

Topic		Replies	Views
Score depending on position in the term on the field Elasticsearch	8	4211	July 6, 2017
Elasticsearch/Lucene scoring broken? Elasticsearch	11	470	July 6, 2017
Using elastic search with per-user ordering of results Elasticsearch	13	873	July 6, 2017
About score of a document Elasticsearch	7	375	July 6, 2017
How is it calculated _score Elasticsearch	5	450	July 6, 2017

Moving the search from PostgreSQL to elasticsearch

title | author | ratings| bookid | editionid | languageid

Best regards, Radu

Related topics

title | author |
ratings| bookid | editionid | languageid

Best regards,
Radu