Graph explore api for webshop case

graph

(Malthe) #1

Hi Elasticsearch community.

I am trying to build a feature with the Explore api, which can tell what other people bought after they have looked at, whatever product you are currently looking at.

I have indexed in ES the documents of what people have looked at and what people have bought, but I can't get the most relevant response from the Api.

What I have currently build looks something like this:

  1. Find the product you are looking at in the database.
  2. Call the Explore api to get the most relevant IP addresses of other people who looked at this product and what product they have looked at.
  3. Call the Explore api again to find the most relevant sales that have been made from the IP addresses from before.
  4. Sort the products that have been sold and find whether they are in the same category as the product you are currently looking at.
  5. Show a message to the user: "Others who looked at this product, ended up buying this product".

Hope that you guys can follow the idea for this features.

Please let me know, if you need anything explained better. :slight_smile:

And also: have any of you tried to build a feature like this with the api?

Regards Malthe.


(Mark Harwood) #2

Hi Malthe,

An example of this sort of thing on click data is in this video (see from 27:30).
Here though we are using query terms and product codes as vertices in the graph rather than user IDs.

The significance algorithms will certainly help to avoid just recommending the universally top-selling products regardless of the starting question but you have to be careful about what vertices you choose to use them on.
Query terms are common to more than one user (so we have many potential data points) and also contain a reasonably clear notion of user intent. They can be useful vertices in a graph.
However, if something low-frequency like an IP address is used as a vertex in the Graph and you have use_significance turned on, then that can act as a dead-end in the crawl because there aren't enough data points around each single IP address to identify a significant relationship to anything else.
It might be better in these cases to just use a bag of IP addresses in a regular search and use the significant_terms agg on the product codes field. This way you are looking for products significantly connected to this mass rather than independently looking for strong associations for each member of that group (and typically failing to find enough repeat-business to be certain about a user->product connection).

If you move away from a data model that is simply user+product+category+query click data to a model that is one doc per user and each contain a list of historic clicks/purchases you can more cheaply and quickly look for product<->product associations. Entity-centric indexing is a useful technique to build these user profiles.


(Malthe) #3

Hi Mark.
Thank you for the answer.

I have looked at your video and read more about significant terms, but I can't seem to get it to work.
Can you take a look at my queries and maybe come with a suggestion on how to improve it? I have posted the queries further down - also they are the old queries because I couldn't get what I tried not to crash.

For the Entity-centric part:
So what you are saying is, that I should index the data, so I have one user with the product the user have looked at in one doc?

Queries
This is the first request I make to the api, getting what people have looked at. The Term variable is the productid.
var response = lowlevelClient.XpackGraphExplore("{INDEX HERE}", @"
{
""query"": {
""query_string"": {
""query"": " + Term + @"
}
},
""controls"": {
""use_significance"": true,
""sample_size"": 1000,
""timeout"": 50000
},
""connections"": {
""query"": {
""bool"": {
""filter"": [
{
""range"": {
""timestamp"": {
""gte"": ""2016-02-25"",
""lt"": ""2017-03-25""
}
}
}
]
}
},
""vertices"": [
{
""field"": ""ipaddress.keyword"",
""size"": 35,
""min_doc_count"": 1
},
{
""field"": ""productid"",
""size"": 35,
""min_doc_count"": 1
}
]
},
""vertices"": [
{
""field"": ""ipaddress.keyword"",
""size"": 35,
""min_doc_count"": 1
},
{
""field"": ""productid"",
""size"": 35,
""min_doc_count"": 1
}
]
}
");

Then what I got of response from the last request, I loop through to get sales from the ips.
var ipsales = lowlevelClient.XpackGraphExplore("INDEX HERE", @"
{
""query"": {
""query_string"": {
""query"": """ + result.Ipaddress + @"""
}
},
""controls"": {
""use_significance"": false,
""sample_size"": 1000,
""timeout"": 50000
},
""connections"": {
""query"": {
""bool"": {
""filter"": [
{
""range"": {
""orderdate"": {
""gte"": ""2016-02-25"",
""lt"": ""2017-03-25""
}
}
}
]
}
},
""vertices"": [
{
""field"": ""salesorderid"",
""size"": 35,
""min_doc_count"": 1
},
{
""field"": ""ipaddress.keyword"",
""size"": 1,
""min_doc_count"": 2
},
{
""field"": ""productid"",
""size"": 35,
""min_doc_count"": 1
}
]
},
""vertices"": [
{
""field"": ""salesorderid"",
""size"": 35,
""min_doc_count"": 5
},
{
""field"": ""ipaddress.keyword"",
""size"": 1,
""min_doc_count"": 5
},
{
""field"": ""productid"",
""size"": 35,
""min_doc_count"": 5
}
]
}");

Hope you can help. :slight_smile:

Regards Malthe


(Mark Harwood) #4

products (plural).

You can see an example of this on the LastFM data using this script [1].
Each user has a list of the artists they liked and you can use thee sorts of docs and Graph to explore the connections between artists without individual user IDs needing to appear as vertices in the crawled graph. The music data works well because people's affiliations to particular musical genres are quite strong, bordering on obsessive. Other datasets e.g. product purchases may have weaker signals due to fewer data points and weaker subject affiliations. There are no toaster cults (as far as I am aware).

In your data it may be useful to make the entities a session (a concrete session ID or perhaps a combo of IP address and a day) and for each session doc to have sets of products viewed/purchased. At least in a single session there may be a stronger sense of continuity between the items listed rather than an all-time perspective of an ip address with changing motives.

[1] https://gist.github.com/markharwood/f67a8532f0acba8dcc3fba07541b0933


(Malthe) #5

Thank you Mark.

I have watched some more of your videos and together with what you told me now, I think I might have a solution.


(Malthe) #6

Hi again.

I have over the past weeks (not every day), tried to create a new index.
So this is what an entity looks like now:

  • Sessionid: IP + date
  • ProductsSeen: Array of products that an IP have seen
  • ProductsBought: Array of products that an IP have bought
  • IPAddress

I get some very nice graphs, but I can't get anything useful from the API, neither from a regular search with significant_terms on products.

Am I doing something wrong? Does the entity look correct?

Regards Malthe


(Mark Harwood) #7

Are you saying the calls to the API differ from what you gather using the UI? That should be a solvable problem because the UI uses the API.

Your model looks good. Let's look at a similar example using instacart data [1].

Nearly everyone buys milk, strawberries and bananas. These are the top products.

So if you search for people who bought "pasta" and have significance turned off you will recommend ..... milk, strawberries and bananas as an accompaniment:

However, if you have significant links turned on you get italian sauce and related:

So it can work on this shape of data but does depend on having a reasonable amount of it and there being some cohesion to the products (pasta and italian sauce tend to go together like Black Sabbath and Motorhead). Also I generally tend to use one shard if possible to keep the all the signal in one place.

[1] https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2


"Often buyed together" using aggregations?
(Malthe) #8

Yes.
When I use the Console in Kibana I am not getting what I expect - for example:

Searching for the Iphone 6, I know alot of people have bought and I know they have looked at it, at the day - I checked in the database manually.

Then I create a search, where I first look at the ProductsSeen with the Iphone 6 as the query string, what i am getting now is not other Iphones or phones in general, I am getting a whole lot more data.
Maybe some of this data relates to Iphone 6, but it is not only phones etc.

Should I use ProductsBought as a significant_terms or should I use the ProductsSeen?


(Mark Harwood) #9

Pasta and italian sauce may frequently be bought together in a shopping session but I expect not many "iphone 6" and "iphone 6s" get bought in the same session. Being higher-value items it would possibly make more sense to use "productSeen" rather than products bought. Using products bought is only likely to tell you which cases go with iphones, not alternatives to iphones.


(Malthe) #10

Okay, so getting the aggregation on ProductsSeen will give me products, that I might be interested in, if I am looking at a certain product. Right?

Next up is finding out, if that product(s) have been sold to whoever looked at it.


(Mark Harwood) #11

Maybe :slight_smile: . We are doing statistical analysis of co-occurrence of terms in a document. If there's enough examples of a pairing (eg italian sauce and pasta) and the pairing is strong (so not milk and pasta) , then we can present useful suggestions. Without playing directly with your data I can't say if it holds enough examples to be useful - just need to try it and see.


(Malthe) #12

Thank you for your help!

I managed to create a prototype of the feature and it is working, wuhu! :slight_smile:

Regards


(Mark Harwood) #13

Glad to hear it's working out on your data!


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.