Hello everyone, I want to ask for help for a recommendation system that I'm trying to put together.
I have data for millions of users that buy in thousands of retail shops (Food, clothes, services). What I need is, given a shop X, recommend N users that might be interested in buying in shop X.
The data I currently have in Elasticsearch looks like this:
PUT recommendations/shop/1
{ "frequent_users": ["user1", "user2", "user3", "user4", "user5", "user6"] }
PUT recommendations/shop/2
{ "frequent_users": ["user1", "user2", "user3",] }
PUT recommendations/shop/3
{ "frequent_users": ["user4", "user5", "user6"] }
PUT recommendations/shop/4
{ "frequent_users": ["user1", "user6"] }
Keep in mind that I have access to every transaction that was made by each user in all the shops, its just that I grouped it like this in order to index it in ES, but I can change how to information is Indexed if needed.
The part that I'm lost is where I try to query the information using the significant_terms
function, as mentioned above, the query I need is, given a shop X, give me a list of users that might want to shop there, this is the query I have so far:
POST recommendations/shop/_search
{
"query": {
"match": {
"frequent_users": "user1"
}
},
"aggregations": {
"clients": {
"significant_terms": {
"field": "frequent_users.keyword",
"min_doc_count": 1
}
}
}
}
What this query does is given a single user1, retrieves similar users according to where other users bought. This is the response of the query above:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.2876821,
"hits": [
{
"_index": "recommendations",
"_type": "shop",
"_id": "1",
"_score": 0.2876821,
"_source": {
"frequent_users": [
"user1",
"user2",
"user3",
"user4",
"user5",
"user6"
]
}
},
{
"_index": "recommendations",
"_type": "shop",
"_id": "4",
"_score": 0.19856805,
"_source": {
"frequent_users": [
"user1",
"user6"
]
}
},
{
"_index": "recommendations",
"_type": "shop",
"_id": "2",
"_score": 0.16853254,
"_source": {
"frequent_users": [
"user1",
"user2",
"user3"
]
}
}
]
},
"aggregations": {
"clients": {
"doc_count": 3,
"bg_count": 4,
"buckets": [
{
"key": "user1",
"doc_count": 3,
"score": 0.3333333333333333,
"bg_count": 3
},
{
"key": "user2",
"doc_count": 2,
"score": 0.22222222222222215,
"bg_count": 2
},
{
"key": "user6",
"doc_count": 2,
"score": 0.22222222222222215,
"bg_count": 2
},
{
"key": "user3",
"doc_count": 2,
"score": 0.22222222222222215,
"bg_count": 2
},
{
"key": "user4",
"doc_count": 1,
"score": 0.11111111111111108,
"bg_count": 1
},
{
"key": "user5",
"doc_count": 1,
"score": 0.11111111111111108,
"bg_count": 1
}
]
}
}
}
Thanks in advance! Any help would be greatly appreciated.