Recommendation System for retail


(Daniel Zuluaga) #1

Hello everyone, I want to ask for help for a recommendation system that I'm trying to put together.

I have data for millions of users that buy in thousands of retail shops (Food, clothes, services). What I need is, given a shop X, recommend N users that might be interested in buying in shop X.

The data I currently have in Elasticsearch looks like this:

PUT recommendations/shop/1
{ "frequent_users": ["user1", "user2", "user3", "user4", "user5", "user6"] }
PUT recommendations/shop/2
{ "frequent_users": ["user1", "user2", "user3",] }
PUT recommendations/shop/3
{ "frequent_users": ["user4", "user5", "user6"] }
PUT recommendations/shop/4
{ "frequent_users": ["user1", "user6"] }

Keep in mind that I have access to every transaction that was made by each user in all the shops, its just that I grouped it like this in order to index it in ES, but I can change how to information is Indexed if needed.

The part that I'm lost is where I try to query the information using the significant_terms function, as mentioned above, the query I need is, given a shop X, give me a list of users that might want to shop there, this is the query I have so far:

POST recommendations/shop/_search
{
    "query": {
        "match": {
            "frequent_users": "user1"
        }
    },
    "aggregations": {
        "clients": {
            "significant_terms": {
                "field": "frequent_users.keyword",
                "min_doc_count": 1
            }
        }
    }
}

What this query does is given a single user1, retrieves similar users according to where other users bought. This is the response of the query above:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "recommendations",
        "_type": "shop",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "frequent_users": [
            "user1",
            "user2",
            "user3",
            "user4",
            "user5",
            "user6"
          ]
        }
      },
      {
        "_index": "recommendations",
        "_type": "shop",
        "_id": "4",
        "_score": 0.19856805,
        "_source": {
          "frequent_users": [
            "user1",
            "user6"
          ]
        }
      },
      {
        "_index": "recommendations",
        "_type": "shop",
        "_id": "2",
        "_score": 0.16853254,
        "_source": {
          "frequent_users": [
            "user1",
            "user2",
            "user3"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "clients": {
      "doc_count": 3,
      "bg_count": 4,
      "buckets": [
        {
          "key": "user1",
          "doc_count": 3,
          "score": 0.3333333333333333,
          "bg_count": 3
        },
        {
          "key": "user2",
          "doc_count": 2,
          "score": 0.22222222222222215,
          "bg_count": 2
        },
        {
          "key": "user6",
          "doc_count": 2,
          "score": 0.22222222222222215,
          "bg_count": 2
        },
        {
          "key": "user3",
          "doc_count": 2,
          "score": 0.22222222222222215,
          "bg_count": 2
        },
        {
          "key": "user4",
          "doc_count": 1,
          "score": 0.11111111111111108,
          "bg_count": 1
        },
        {
          "key": "user5",
          "doc_count": 1,
          "score": 0.11111111111111108,
          "bg_count": 1
        }
      ]
    }
  }
}

Thanks in advance! Any help would be greatly appreciated.


(Mark Harwood) #2

Reading it back, the task you have is to "find more people like the people who visit Store X".
The challenge is identifying a certain type of person.
Judging by the data you have presented you know nothing about these people in terms of age, gender, location, friendships etc - the only thing you can possibly use to describe "that type" of person is looking at the list of other stores they visit in the hope that somehow "defines" them. For that you would need a person-centric customer index eg.

{ "user": 1, "visited_stores": ["x", "y", "z"] }

You'd then query for the existing store X customers and look for significant terms on the "visited_stores" field to see if there was anything "uncommonly common" about them. You would then use these suggested stores as a query - minus those people who are already Store X visitors eg

{
    "bool":{
        "should":[
             { "term" :{ "visited_stores":"store_significantly_like_x"}},
             { "term" :{ "visited_stores":"another_store_significantly_like_x"}},
             ...
       ],
       "must_not" : [
             { "term" :{ "visited_stores":"x"}}
      ]
    }

Note you should use the verbose example of multiple term queries in a should clause rather than a single terms query because elasticsearch assumes you don't want relevance scoring on terms queries and we absolutely do want IDF scoring on rare things like Joe's Skate Shack rather than common terms like Walmart.


(Daniel Zuluaga) #3

Hello Mark, thank you so much for your help!

I do have more information about each costumer, this is a sample of the table with millions on transactions that I have:

consumer_id trx_date trx_time value commerce_id tj name gender age stratum occupation email income lat lon city mcc
1 20180528 11:44:50 181400 10803914 1 name F 63 NULL @yahoo.com 7 11,00892419 -74,83473888 BARRANQUILLA 763
2 20180516 19:17:38 131,58 8060000000 1 name M 44 Empleado @hotmail.com 12 4,650019173 -74,12242269 BOGOTA NULL
3 20180516 15:23:46 1181040 612250000 1 name M 39 Empleado @gmail.com 8 4,744751454 -74,086149 BOGOTA NULL
4 20180524 14:06:57 116150 12321071 1 name F 39 Independiente @hotmail.com 4 11,00649976 -74,83454968 BARRANQUILLA 9399

The important fields being, gender, age, occupation, income, lat, long and city. The shop id would be the field commerce_id. I could also put the shop name on this data, but I'm using the commerce_id as a unique shop identifier.


(Mark Harwood) #4

Significant terms is designed to find individual terms that are correlated with your query.
However, the type of people who visit store X might be best identified by multiple terms in combination e.g. the store is found to be popular with women aged between 20 and 30. That particular combo of information is not currently a single term in the index so can't be discovered using the significant_terms aggregation (unless you index using special single-token strings eg Male+Teenager+London). You may find doing some analysis in R or similar would be a better way to discover the combinations of attributes that define Store X customers and then use elasticsearch to query for customers with those attributes.


(Daniel Zuluaga) #5

What if I index using your suggestion of a person-centric customer index, and add the demographic information like so:

{ "user": 1, "gender": "Male", "city": "aaa", "age_group": "Middleage", "occupation": "Zzz", "visited_stores": [1, 2, 3, 4, 5] }
{ "user": 2, "gender": "Female", "city": "aaa", "age_group": "Teenager", "occupation": "Xxx", "visited_stores": [1, 2, 3, 4, 5] }
{ "user": 3, "gender": "Male", "city": "bbb", "age_group": "Middleage", "occupation": "Zzz", "visited_stores": [1, 2, 3, 4, 5] }
{ "user": 4, "gender": "Female", "city": "ccc", "age_group": "Teenager", "occupation": "Www", "visited_stores": [1, 2, 3, 4, 5] }

Would something like this work for asking the question as you stated?

"Find more people like the people who visit Store X"


(Mark Harwood) #6

Would something like this work

Yes, it would help but as I mentioned in my previous comment I suspect it would be useful to also try index combos of demographic info if you want to discover, for example, that shop X is predominantly for middle-aged men. You'd need a field which concatenated age_group and gender into a single term.


(Daniel Zuluaga) #7

So it would be something similar with a new field, like this:

{ "user": 1, "gender": "Male", "city": "aaa", "age_group": "Middleage", "occupation": "Zzz", "visited_stores": [1, 2, 3, 4, 5] , "combo": "Male+aaa+Middleage+Zzz" }
{ "user": 2, "gender": "Female", "city": "aaa", "age_group": "Teenager", "occupation": "Xxx", "visited_stores": [1, 2, 3, 4, 5] , "combo": "Female+aaa+Teenager+Xxx" }
{ "user": 3, "gender": "Male", "city": "bbb", "age_group": "Middleage", "occupation": "Zzz", "visited_stores": [1, 2, 3, 4, 5] , "combo": "Male+bbb+Middleage+Zzz" }
{ "user": 4, "gender": "Female", "city": "ccc", "age_group": "Teenager", "occupation": "Www", "visited_stores": [1, 2, 3, 4, 5] , "combo": "Female+ccc+Teenager+Www" }

And then could you help me with the way the query would be after indexing the information like this?


(Mark Harwood) #8

You'd need to map combo as a keyword field type then run a query something like this:

POST users/user/_search
{
	"query": {
		"match": {
			"visited_stores": "1"
		}
	},
    "size":0,
	"aggregations": {
		"clients": {
			"significant_terms": {
				"field": "combo"
			}
		}
	}
}

That would give you the stereotypes for store 1 visitors.
Then you do the query to find the people who haven't visited store 1 but fit the store 1 stereotype:

POST users/user/_search
{
	"query": {
		"bool":{
				"should":[
					 { "term" :{ "combo":"Male+aaa+Middleage+Zzz"}},
					 ... other significant stereotypes
			   ],
			   "must_not" : [
					 { "term" :{ "visited_stores":"1"}}
			  ]
			}
	}
}

(Daniel Zuluaga) #9

Hello Mark, I just wanted to thank you for all your help!

I have a solution working at the moment and planing to deploy into production soon.


(Mark Harwood) #10

Good to hear! Hope all goes well.

It might be worth considering using the sampler aggregation (or the diversified_sampler) in conjunction with significant terms. It can improve both the search response times and the quality of results - see https://www.youtube.com/watch?v=azP15yvbOBA

Another tip to make sure you have an effective system is benchmarking on existing data to determine how effective your store visitor profiles are. To do this change a percentage of the records of users who have visited store X by removing it from the list of visited stores but add store X to a new field called "heldBackStores". You can then benchmark your recommendation system by seeing how many recommendations are offered to users who match what you have derived as the "Store X profile of customer" and seeing how many of the genuine store X users (those with an entry in heldBackStores) you manage to match.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.