I'll speak better in code. That was my initial idea:
Statistic index:
PUT statistic
{
"mappings": {
"statistic": {
"properties": {
"StatisticID": {
"type": "keyword"
},
"IndustryName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"ProductId": {
"type": "integer"
},
"ProductName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"CountryCodeId": {
"type": "integer"
},
"CountryName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"MeasureTypeId": {
"type": "integer"
},
"MeasureType": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"IsPopular": {
"type": "boolean"
},
"ResultTypeId": {
"type": "byte"
},
"Description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"SortOrder": {
"type": "integer"
}
}
}
}
}
Clients index:
PUT /clients
{
"mappings": {
"statistic": {
"properties": {
"StatisticIDs": {
"type": "keyword"
}
}
}
}
}
Dummy data:
PUT /statistic/statistic/561_12351_213
{
"CountryeId": 213,
"CountryName": "Malta",
"IndustryName": "Fresh Food",
"ProductId": 12351,
"ProductName": "Pork",
"MeasureTypeId": 561,
"MeasureType": "Market Sizes",
"Description": "Aggregated sales in a time series by standard data types, per capita and growth.",
"IsPopular": true,
"SortOrder": 100,
"ResultTypeId": 1,
"StatisticID": "561_12351_213"
}
PUT /clients/statistic/1235143
{
"StatisticIDs": [
"561_12351_213",
"561_12351_214"
/* up to 2,000,000 other possible combinations */
]
}
Actual search query:
GET /statistic/_search
{
"query": {
"bool": {
"filter": {
"terms": {
"StatisticID": {
"index": "clients",
"type": "statistic",
"id": "1235143", <--- this is client_id
"path": "StatisticIDs"
}
}
}
}
}
}
I wanted to have separate indexes because statistic
index is being updated rarely - once in a week, while the client index is updated regularly.
In a relational world it makes sense because you can query this quite efficiently using a simple join between two tables:
SELECT S.*
FROM Elasticsearch.Statistic AS S
INNER JOIN Elasticsearch.Client AS C
ON C.MeasureTypeId = S.MeasureTypeId
AND C.ProductId = S.ProductId
AND C.CountryId = S.CountryId
WHERE C.ClientId = @ClientId;
This would work quite well in Elasticsearch too if the amount of StatisticIDs
in client
index wouldn't be that big.
This also simplifies the update process because whenever there's a change in client index data i can just rewrite the whole document without having to make explicit checks what was updated.
The whole data is right now managed in SQL Server and I'm aiming to use Kafka stack to read data from SQL Server and put into Kafka topics and then push it to Elasticsearch.