We're looking for guidance on any pitfalls in terms of scale / performance we may run into when using the nested field type in Elasticsearch. We have a use-case where we have a 100k+ different customer schemas that we are working with, some of which may have 1m+ documents associated with them, and we would like to support a wide range of customer queries on these datasets.
For example we may have a customer that stores contracts with us and their contracts have a CounterParty (string), Amount (number), Expiration (date). We'd like to support queries such as "contracts with Amount > 100000 and ExpirationDate > 2020-01-01 ordered by CounterParty".
We'd love for these queries to stay interactive (i.e. < 1s from 10k's to to 100's of million of docs). We've done some internal testing, but I wanted to post this to help us discover any risks that we missed.
We were planning on using the following data model to support the large numbers of schemas without causing mapping explosion:
{
"id": "12345",
"parent": "item_4567",
"stringFields": [
{
"typeKey": "contracts",
"fieldKey": "counterParty",
"value": "ACME Inc."
}, ...
],
"numberFields": [
{
"typeKey": "contracts",
"fieldKey": "amount",
"value": 100001
}, ...
],
"dateFields": [
{
"typeKey": "contracts",
"fieldKey": "expiration",
"value": "2020-01-24"
}, ...
],
}
The mappings would be:
{
"mappings": {
"_doc": {
"properties": {
"stringFields": {
"type": "nested",
"properties": {
"type": "string"
}
},
"numberFields": {
"type": "nested",
"properties": {
"type": "long"
}
},
"dateFields": {
"type": "nested",
"properties": {
"type": "date"
}
},
...
}
}
}
}
And the queries would look like:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "numberFields",
"query": {
"bool": {
"must": [
{ "match": { "numberFields.typeKey": "contract" } },
{ "match": { "numberFields.fieldKey": "amount" } },
{ "range": { "numberFields.value": { "gte": 100000 } } }
]
}
}
}
},
{
"nested": {
"path": "dateFields",
"query": {
"bool": {
"must": [
{ "match": { "numberFields.typeKey": "contract" } },
{ "match": { "numberFields.fieldKey": "expiration" } },
{ "range": { "numberFields.value": { "gte": "2020-01-01" } } }
]
}
}
}
}
]
}
},
"sort": [
{
"stringFields.value" : {
"order" : "asc",
"nested": {
"path": "stringFields",
"filter": {
"bool": {
"must": [
{ "match": { "stringFields.typeKey": "contract" } },
{ "match": { "numberFields.fieldKey": "counterParty" } }
]
}
}
}
}
}
]
}
Any guidance you could provide about things we would need to look out for in a model like this... Especially around sorting performance at scale, pagination on those results, notable feature limitations would be super helpful and we'd be deeply appreciative.