Sample input json:
Email1 - Paragraph1:
{
“ParagraphId”: “Para1",
“EmailId”: “Email1",
“Content”: “The city is Chennai”,
“Tags”: [
{
“TagId”: “Tag1”,
“Name: “City”,
“Values”: [
{
“Value”: “Chennai”
}
]
}
]
}
Email1 - Paragraph2:
{
“ParagraphId”: “Para2",
“EmailId”: “Email1",
“Content”: “It has around 10000 people living in the city.“,
“Tags”: [
{
“TagId”: “Tag2”,
“Name: “Population”,
“Values”: [
{
“Value”: “10000"
}
]
}
]
}
Email2 - Paragraph1:
{
“ParagraphId”: “Para1",
“EmailId”: “Email2",
“Content”: “The city is Bengaluru”,
“Tags”: [
{
“TagId”: “Tag1”,
“Name: “City”,
“Values”: [
{
“Value”: “Bengaluru”
}
]
}
]
}
Email2 - Paragraph2:
{
“ParagraphId”: “Para2",
“EmailId”: “Email2",
“Content”: “The city’s population is about 5000.“,
“Tags”: [
{
“TagId”: “Tag2”,
“Name: “Population”,
“Values”: [
{
“Value”: “5000"
}
]
}
]
}
Problem statement:
Email - can have multiple paragraphs
Paragraph - can have multiple Tags
Tag - can have multiple Values
Query:
Get the total population of all the cities grouped by city name
Current solution:
We’re using aggegations in Elasticsearch (2 queries) to solve this.
First we’re identifying the Paragraphs in the Emails which has the Tag name “City” linked to it.
In this case, the 1st Paragrah from both the Emails will be filtered and the list of EmailIds under each City bucket.
Then based on the EmailIds retrieved under each bucket, we’ll check for the Paragraphs that has the “Population” Tag linked to it. For those Population paragraphs, the SUM aggregate function will be applied.
Complexity:
The time taken for the queries here depends on the number of Cities present in the Paragraphs and the number of aggregate functions performed.
It takes more time for the query to execute with increasing bucket size.
Questions:
- Is there any way, we can optimise the querying part?
- Is there any schema change needed in the way we’re saving the data here?
- If no possible way to achieve in ES, any other database alternatives?
- It is more challenging to construct a query with multiple group by fields (Country and City tags). Any easy way to handle this?
- If we’re returning the list of values under each bucket (instead of SUM aggregation here), ES seems to have a limit of 10000 values per bucket too. Any solution here too?