Full outer join type queries - comparing two data sets


(James Crowley) #1

hey folks,

I'm trying to figure the best way to structure/query our data in ES, and would really appreciate some pearls of wisdom! We have documents at the moment that look like this:

Inventory { InventorySnapshotId: "2010-02-01", InventoryId: "ProductA", Quantity: 100, ManufacturerName: "John" }
Inventory { InventorySnapshotId: "2010-02-02", InventoryId: "ProductA", Quantity: 90, ManufacturerName: "Johnny" }
Inventory { InventorySnapshotId: "2010-02-01", InventoryId: "ProductB", Quantity: 50, ManufacturerName: "Sarah" }
Inventory { InventorySnapshotId: "2010-02-02", InventoryId: "ProductC", Quantity: 40, ManufacturerName: "Sally" }

Whenever we query this data, it's always relative to another baseline InventorySnapshotId. For instance, querying "2010-02-02" against a baseline of "2010-02-01"

 InventoryId | Quantity | BaselineQuantity | ManufacturerName | BaselineManufacturerName
 ProductA    | 90       | 100                | Johnny           | John
 ProductB    | 0        | 50                 |                  | Sarah   <- sold out inventory
 ProductC    | 40       |                    | Sally            |         <- new inventory

What's the best way to express/store this? In SQL semantics we'd use a full outer join between the current and the baseline rows... I'm thinking we might have to group in elastic search and then do some magic in the consumer, but.... any thoughts?

In case it's helpful,

Data summary:

  • There could be a few million inventory items per InventorySnapshotId, and several thousand InventorySnapshots a day
  • Once a snapshot is stored, it never changes
  • There are around 80 different properties for each inventory item, defined by an external system - we have meta data about the fields, but they can be added/removed over time, some of which we'd search on, some of which we'd run aggregate queries over... hence why I'm thinking ElasticSearch is a good fit.

Query summary:

  • page though the data (on a specific date, against a baseline), and some other criteria (such as ManufacturerName matching either today or the comparison)
  • There would also be aggregate queries and faceted searches over these data sets (ie group on current manufacturer name, and give me totals for the current quantity and baseline quantity)
  • There's also a hierarchy element to this data, but I think that's one for another day!

Any suggestions or approaches I should be looking at? Thanks!


(system) #2