How to aggregate data across multiple fields?

Hi!

I'm fairly new to Kibana. I'm trying to perform data aggregation across multiple fields, but don't seem to be approaching the problem the right way. Example use case:

I have data about house prices. E.g. house "a" costs 10, house "b" costs 20, house "c" costs 5, house "d" costs 50. Example data point:

{
  "house": "a",
  "cost": 10,
}

I also have data about ownership, e.g. person "p1" owns houses "[a, b]", person "p2" owns houses "[c, d]". Example data point:

{
  person: "p1",
  houses: ["a", "b"],
}

Goal: I want to visualize a pie chart that shows the aggregated house cost for each person. I.e. "p1" should get a value of 10 + 20 = 30, and "p2" should get a value of 5 + 50 = 55.

This currently doesn't work with Pie Chart visualization because there's no link between "person" and "cost".

I don't want to pre-calculate this aggregation before pushing the data to Kibana - my hope is that Kibana can aggregate the data on the fly.

  • Can I solve this problem in Kibana, or do I need to pre-compute these values before pushing the data?
  • How can I "link" the data from the 2 different types of datapoints? Is there a better way to lay out the data?

Thanks for the help!

This seems like normalized data. Unfortunately, Elasticsearch doesn't currently support joins which you would normally use to do something like this in a normalized context. Thankfully, you can get pretty far in ES, by pre-joining or denormalizing your data at ingestion time.

That is probably not the answer that you'd like to hear, but once the data is denormalized, the whole power of elastic query DSL will be available to build out any aggregation you want.

So this example data you've written out would look a little different denormalized. Since the way you've described it looks like a one to many relationship - one person owns many houses - each document would be a house, and the person would appear in multiple rows:

{ "person": "p1", " house": "a",  "price": 10  },
{ "person": "p1", " house": "b",  "price": 20  },
{ "person": "p2", " house": "c",  "price": 5  },
{ "person": "p2", " house": "d",  "price": 50  }

At ingestion time, you'd also have to think through complicated situations, for instance: when one house is owned by multiple people, is the cost of the house shared or pro-rated between each person?

After the data is denormalized, the aggregation to get the data you're looking for becomes easy and extremely efficient. You would use a terms aggregation, with a sum aggregation inside.

Thanks for the detailed explanation! I see what you mean now by normalized data.

How would the data look like if there's a N:M relationship between house and person? I.e. multiple persons can own the same house.

I suppose it's just about duplicating the data, i.e.

{ "person": "p1", "house": "a", "price": 10},
{ "person": "p2", "house": "a", "price": 10}

Would that make sense? Sounds like the amount of data can grow exponentially if more parameters are introduced though, e.g. size of the house, number of rooms, etc.

What happens when the data has a many to many relationship is something you'd have to think through on a case by case basis.

Duplicating the data like you've suggested is a good approach in general, but might not be truthful, because if you aggregate house data, now all of a sudden house a looks like it's worth 20, when it's really worth 10.

If you were to take the proration approach, you'd divide the house price by the number of people who own it, so the data would look like:

{ "person": "p1", "house": "a", "price": 5 },
{ "person": "p2", "house": "a", "price": 5 }

This keeps the price of the house accurate, and might be a better reflection of reality depending on the data. Likely a house doesn't double in price as soon as two people share ownership over it.

Unfortunately, the data does grow exponentially with each many-to-many relationship you join into the data. That is a downside of the denormalization process, but the examples you've given wouldn't actually create such a data explosion. If you introduced those two fields the data would look like:

{ "person": "p1", " house": "a",  "price": 10 , "roomCount": 5, "sizeInSqFt": 2500 },
{ "person": "p1", " house": "b",  "price": 20, "roomCount": 2, "sizeInSqFt": 4500 },
{ "person": "p2", " house": "c",  "price": 5, "roomCount": 1, "sizeInSqFt": 100  },
{ "person": "p2", " house": "d",  "price": 50 , "roomCount": 5, "sizeInSqFt": 2000 }

Notice how the number of rows is the same, because the house is already the most granular level of the data. When you would get an explosion of data is if you wanted to join in a third data set with a many-to-many relationship to the original tables.

You can think of denormalization as a trade off. Yes, you are creating more documents, and duplicating your data, but you are also making it way easier for the computer to search through it and do math on it.

I'd recommend more reading on the subject, as it's honestly fascinating! I can speak from experience that if the denormalization process is done right it can result in an extremely fast and powerful experience in Elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.