Elasticsearch similar elements count

I have the following product order list and need to find out what products are ordered together:

  [
      {
        "order_id": 1,
        "products": ["Coke", "Fanta"],
        "created_on": "2016-12-26T03:41:46+00:00"
    },
      {
        "order_id": 2,
        "products": ["Coke", "Pepsi", "Sprite"],
        "created_on": "2016-12-26T03:42:46+00:00"
      },
      {
        "order_id": 3,
        "products": ["Coke", "Sprite"],
        "created_on": "2016-12-26T03:45:46+00:00"
      }
    ]

As per the above example, the correct aggregation output should be "Coke" and "Sprite" since they were ordered 3 times.

Expected output:

{
  "products_duo_of_the_day": 1,
  "buckets": [
    {
      "products": [
        "Coke",
        "Sprite"
      ],
      "doc_count": 2
    }
  ]
}

Can someone please help me with building the Elastisearch query for getting this example output?

Check out the significant terms aggregation. This video describes using it for a similar scenario. Substitute band names for your products and you’re there.

2 Likes

A few things jump out

  1. The result doesn't look like what I'd expect - where's the score value?
  2. Significant_terms not the significant_text agg should be used on keyword fields.
  3. I previously assumed the question was "what goes with coke?" but it looks like the question you're trying to ask is potentially something to do with "what's special about pairings on given days?"

Can you say more about the business problem you're trying to solve?

1 Like

Is this about ranking straight popularity or particular significance on each day?
The top answer to the former is likely to be "bread and milk" for every single day.
The top answer to the latter could be unusual changes e.g pancake ingredients last Tuesday.

One way of optimising the index for this analysis is to also index SKU pairs. So given ["milk", "bread"] you'd also index them as an alphabetically sorted pair i.e. the pair token "bread_milk". Problem is that is an n-squared indexing strategy so doesn't work well when there are many unique products and many items purchased per basket.

1 Like

Ok - that’s just a ‘terms’ aggregation on a SKU-pair field like I described in my last post. The problem is when you have baskets with more than bread and milk in them. Your application code would have to generate all the pairs for a basket with bread, cheese, milk, eggs, coke etc. That’s a lot of SKU pairs and probably too expensive to generate if baskets are big.
Another approach is to have a two terms aggs on the “SKU” field - one nested under the other. This will give you the top 10 products on a day, and for each of those, the top 10 companion products also bought with those. It’s a slightly different analysis though eg in a worst case scenario a number one product might be selected that is only ever bought on its own (a ticket to park the car on arrival?) and has no pairings. So it’s really the most popular products and what they’re paired with rather than the most popular product pairs. Probably good enough though.

1 Like

Won’t the 2 terms aggs on the SKU field work for you?

So you’d have:

Date agg
    Terms agg on SKU 
         Terms agg on SKU

With complex script that is very likely to blow up memory. What is about the 2 terms aggs approach that doesn’t work for you?