Per-user ordering of search results

I have an Elasticsearch setup, where I would like to add per-user ordering,
i.e. every user gets different search results depending on how they've
interacted with the documents. We have a score-table in MySQL with a row
per document per user, and we would like to sort search results based on
the specific user's scores.

We've investigated adding a field per user to each document with the name
"score{user_id}", e.g. "_score_5327" for the user with id 5327. Quering
elasticsearch on behalf of that user, then requires specifying "sort": {
"_score_5327": { "order": "desc", ignore_unmapped: true } }. By keeping the
per-user score on the root document, we sidestep the problem of not being
able to sort on nested document fields. We can keep the scores up-to-date
with the partial update API (
http://www.elasticsearch.org/guide/reference/api/update.html).

The approach works well in development, but when we re-build our index in
staging with a lot more data, ES falls over after a lot of long GC pauses
and then a java.lang.OutOfMemoryError: Java heap space. Does the extra
fields cause ES to fail? Can ES not handle the many extra fields (~3,000
scores for the most popular document)? What are alternative solutions?

Thanks,
Andreas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You're going to run into problems creating a field per user. And in
version 0.90, you can sort on nested fields (including multi-value fields).

The GC and OOM may or may not be related. You'd need to tell us more about
what you're doing to diagnose the issue there.

clint

On Tue, Apr 30, 2013 at 2:17 PM, garnaes@hoisthq.com wrote:

I have an Elasticsearch setup, where I would like to add per-user
ordering, i.e. every user gets different search results depending on how
they've interacted with the documents. We have a score-table in MySQL with
a row per document per user, and we would like to sort search results based
on the specific user's scores.

We've investigated adding a field per user to each document with the name
"score{user_id}", e.g. "_score_5327" for the user with id 5327. Quering
elasticsearch on behalf of that user, then requires specifying "sort": {
"_score_5327": { "order": "desc", ignore_unmapped: true } }. By keeping the
per-user score on the root document, we sidestep the problem of not being
able to sort on nested document fields. We can keep the scores up-to-date
with the partial update API (http://www.elasticsearch.org/**
guide/reference/api/update.**htmlhttp://www.elasticsearch.org/guide/reference/api/update.html
).

The approach works well in development, but when we re-build our index in
staging with a lot more data, ES falls over after a lot of long GC pauses
and then a java.lang.OutOfMemoryError: Java heap space. Does the extra
fields cause ES to fail? Can ES not handle the many extra fields (~3,000
scores for the most popular document)? What are alternative solutions?

Thanks,
Andreas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My product does something similar for searching events. If you're willing
to sacrifice resolution on the score, you can get really good performance
by creating lists of user ids as they apply different action. Then at query
time you can do boosting on a term query.

Example doc:
name: Great Document
contributed: [3,4,6,7]
friend_of_author: [4,5,34,543,773,888]

Then if user 4 performs query you can add a "should" clause to your bool
query:
"should":[{"term":{"contributed":{"term":4, "boost":4}}},
{"term":{"friend_of_author":{"term":4, "boost":2}}} ]

FYI: I have on average 1K-10K user IDs added to every document and there
is negligible overhead compared to the rest of the query.

On Tuesday, April 30, 2013 12:47:17 PM UTC-7, Clinton Gormley wrote:

You're going to run into problems creating a field per user. And in
version 0.90, you can sort on nested fields (including multi-value fields).

The GC and OOM may or may not be related. You'd need to tell us more about
what you're doing to diagnose the issue there.

clint

On Tue, Apr 30, 2013 at 2:17 PM, <gar...@hoisthq.com <javascript:>> wrote:

I have an Elasticsearch setup, where I would like to add per-user
ordering, i.e. every user gets different search results depending on how
they've interacted with the documents. We have a score-table in MySQL with
a row per document per user, and we would like to sort search results based
on the specific user's scores.

We've investigated adding a field per user to each document with the name
"score{user_id}", e.g. "_score_5327" for the user with id 5327. Quering
elasticsearch on behalf of that user, then requires specifying "sort": {
"_score_5327": { "order": "desc", ignore_unmapped: true } }. By keeping the
per-user score on the root document, we sidestep the problem of not being
able to sort on nested document fields. We can keep the scores up-to-date
with the partial update API (http://www.elasticsearch.org/**
guide/reference/api/update.**htmlhttp://www.elasticsearch.org/guide/reference/api/update.html
).

The approach works well in development, but when we re-build our index in
staging with a lot more data, ES falls over after a lot of long GC pauses
and then a java.lang.OutOfMemoryError: Java heap space. Does the extra
fields cause ES to fail? Can ES not handle the many extra fields (~3,000
scores for the most popular document)? What are alternative solutions?

Thanks,
Andreas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the suggestion, Clint. What is the advantage of using a nested
document compared to my current approach (fields directly on the document)?

Wrt. GC/OOM: if I run the exact same re-indexing script without including
the user-scores, it runs fine, so I'm fairly certain that it's related.
I've only tested this in our staging environment, where we run a single
node, single index, ~45 mio documents (~30GB), 6 doc types. The machine has
8GB RAM of which ES is allocated half. I'd be happy to provide more
details, if that could help.

On Tuesday, 30 April 2013 21:47:17 UTC+2, Clinton Gormley wrote:

You're going to run into problems creating a field per user. And in
version 0.90, you can sort on nested fields (including multi-value fields).

The GC and OOM may or may not be related. You'd need to tell us more about
what you're doing to diagnose the issue there.

clint

On Tue, Apr 30, 2013 at 2:17 PM, <gar...@hoisthq.com <javascript:>> wrote:

I have an Elasticsearch setup, where I would like to add per-user
ordering, i.e. every user gets different search results depending on how
they've interacted with the documents. We have a score-table in MySQL with
a row per document per user, and we would like to sort search results based
on the specific user's scores.

We've investigated adding a field per user to each document with the name
"score{user_id}", e.g. "_score_5327" for the user with id 5327. Quering
elasticsearch on behalf of that user, then requires specifying "sort": {
"_score_5327": { "order": "desc", ignore_unmapped: true } }. By keeping the
per-user score on the root document, we sidestep the problem of not being
able to sort on nested document fields. We can keep the scores up-to-date
with the partial update API (http://www.elasticsearch.org/**
guide/reference/api/update.**htmlhttp://www.elasticsearch.org/guide/reference/api/update.html
).

The approach works well in development, but when we re-build our index in
staging with a lot more data, ES falls over after a lot of long GC pauses
and then a java.lang.OutOfMemoryError: Java heap space. Does the extra
fields cause ES to fail? Can ES not handle the many extra fields (~3,000
scores for the most popular document)? What are alternative solutions?

Thanks,
Andreas

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

That's a great suggestion, thanks! Do you handle actions that can occur
multiple times too? E.g. commenting on a document ten times scores higher
than two times.

On Wednesday, 1 May 2013 02:05:11 UTC+2, Taras Shkvarchuk wrote:

My product does something similar for searching events. If you're willing
to sacrifice resolution on the score, you can get really good performance
by creating lists of user ids as they apply different action. Then at query
time you can do boosting on a term query.

Example doc:
name: Great Document
contributed: [3,4,6,7]
friend_of_author: [4,5,34,543,773,888]

Then if user 4 performs query you can add a "should" clause to your bool
query:
"should":[{"term":{"contributed":{"term":4, "boost":4}}},
{"term":{"friend_of_author":{"term":4, "boost":2}}} ]

FYI: I have on average 1K-10K user IDs added to every document and there
is negligible overhead compared to the rest of the query.

On Tuesday, April 30, 2013 12:47:17 PM UTC-7, Clinton Gormley wrote:

You're going to run into problems creating a field per user. And in
version 0.90, you can sort on nested fields (including multi-value fields).

The GC and OOM may or may not be related. You'd need to tell us more
about what you're doing to diagnose the issue there.

clint

On Tue, Apr 30, 2013 at 2:17 PM, gar...@hoisthq.com wrote:

I have an Elasticsearch setup, where I would like to add per-user
ordering, i.e. every user gets different search results depending on how
they've interacted with the documents. We have a score-table in MySQL with
a row per document per user, and we would like to sort search results based
on the specific user's scores.

We've investigated adding a field per user to each document with the
name "score{user_id}", e.g. "_score_5327" for the user with id 5327.
Quering elasticsearch on behalf of that user, then requires specifying
"sort": { "_score_5327": { "order": "desc", ignore_unmapped: true } }. By
keeping the per-user score on the root document, we sidestep the problem of
not being able to sort on nested document fields. We can keep the scores
up-to-date with the partial update API (http://www.elasticsearch.org/**
guide/reference/api/update.**htmlhttp://www.elasticsearch.org/guide/reference/api/update.html
).

The approach works well in development, but when we re-build our index
in staging with a lot more data, ES falls over after a lot of long GC
pauses and then a java.lang.OutOfMemoryError: Java heap space. Does the
extra fields cause ES to fail? Can ES not handle the many extra fields
(~3,000 scores for the most popular document)? What are alternative
solutions?

Thanks,
Andreas

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.