Collaborative Filtering

Hi everybody,

disclaimer: I am not experienced at all with Elasticsearch.

I'd like to implement some collaborative filtering following a short example from Trey Grainger, slides 28ff, where he wants to recommend books to potential buyers.
In the given example, suppose we are looking for recommendations for user5 and we have stored the data like this:

{
  "mappings": {
    "purchase": {
      "properties": {
        "book_id": { "type": "integer" },
        "user_id": { "type": "integer" }
      }
    }
  }
}

In order to reduce the amount of data transferred from the ES cluster, eventually, I'd like to do all of this in a single query.
Also, I'm not bound to this representation of the data. If a different representation would make this significantly easier, I'd be happy to know.

For now, I tried this using multiple queries. First, I get the books that user 5 likes (books 1 and 4), and afterwards similar users (users 1 and 4, having 1 and 2 shared purchases, respectively). Hence, I'd like to boost the book recommendations based on user 4 by a factor of 2 (size of overlap in purchases with user 5). My current (not-working) draft looks like this:

curl -sXPOST 'http://localhost:9200/purchases/purchase/_search' -d '{
  "query": {
    "terms": {
      "user_id": [1,4]
    }
  },
  "size": 0,
  "aggs": {
    "recommendations": {
      "terms": {
        "field": "book_id",
        "exclude": [1,4]
      },
      "aggs": {
        "score": {
          "sum": {
            "script": {
              "inline": """
                if (doc['user_id'] == 4) {
                  return 2
                }
                if (doc['user_id'] == 1) {
                  return 1
                }
              """
            }
          }
        }
      }
    }
  }
}

What am I doing wrong?
Am I on the right track?
Is this a easier to solve using Elasticsearch Graph (X-Pack), and if so: how?
How would I eventually do all of this using a single query (that I only provide with the user I want to get the recommendations for)?

Any help is greatly appreciated! Also, my companies ES cluster is running version 2.3.0 ... we have plans to upgrade to a recent 5.x, but this is not feasible in the short term.

Best, Jonas

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.