Question: Are we going to run into problems using (abusing?) nested at scale

We're looking for guidance on any pitfalls in terms of scale / performance we may run into when using the nested field type in Elasticsearch. We have a use-case where we have a 100k+ different customer schemas that we are working with, some of which may have 1m+ documents associated with them, and we would like to support a wide range of customer queries on these datasets.

For example we may have a customer that stores contracts with us and their contracts have a CounterParty (string), Amount (number), Expiration (date). We'd like to support queries such as "contracts with Amount > 100000 and ExpirationDate > 2020-01-01 ordered by CounterParty".

We'd love for these queries to stay interactive (i.e. < 1s from 10k's to to 100's of million of docs). We've done some internal testing, but I wanted to post this to help us discover any risks that we missed.

We were planning on using the following data model to support the large numbers of schemas without causing mapping explosion:

{
    "id": "12345",
    "parent": "item_4567",
    "stringFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "counterParty",
            "value": "ACME Inc."
        }, ...
    ],
    "numberFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "amount",
            "value": 100001
        }, ...
    ],
    "dateFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "expiration",
            "value": "2020-01-24"
        }, ...
    ],
}

The mappings would be:

{
  "mappings": {
    "_doc": {
      "properties": {
        "stringFields": {
          "type": "nested",
          "properties": {
              "type": "string" 
          }
        },
        "numberFields": {
          "type": "nested",
          "properties": {
              "type": "long"
          }
        },
        "dateFields": {
          "type": "nested",
          "properties": {
              "type": "date"
          }
        },
        ...
      }
    }
  }
}

And the queries would look like:

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "numberFields",
            "query": {
              "bool": {
                "must": [
                  { "match": { "numberFields.typeKey": "contract" } },
                  { "match": { "numberFields.fieldKey": "amount" } },
                  { "range": { "numberFields.value": { "gte": 100000 } } }
                ]
              }
            }
          }
        },
        {
          "nested": {
            "path": "dateFields",
            "query": {
              "bool": {
                "must": [
                  { "match": { "numberFields.typeKey": "contract" } },
                  { "match": { "numberFields.fieldKey": "expiration" } },
                  { "range": { "numberFields.value": { "gte": "2020-01-01" } } }
                ]
              }
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "stringFields.value" : {
        "order" : "asc",
        "nested": {
          "path": "stringFields",
          "filter": {
            "bool": {
              "must": [
                { "match": { "stringFields.typeKey": "contract" } },
                { "match": { "numberFields.fieldKey": "counterParty" } }
              ]
            }
          }
        }
      }
    }
  ]
}

Any guidance you could provide about things we would need to look out for in a model like this... Especially around sorting performance at scale, pagination on those results, notable feature limitations would be super helpful and we'd be deeply appreciative.

Hi John,

I have been doing some research on this, and what I have found is Nested fields will only really benefit you if you expect high thousands of fields per index. like > 3000, and even then there's sometimes still better choices, it's not a preferred route as a first choice. We generally recommend flattened schemas first.

How many fields per customer, and how many customers do you expect? I see the 100k schemas, how many fields per schema?

An interesting option for you to check out, is to use dynamic mappings, and you could enforce a schema to some degree. If you had at least standard prefixes for your field types you could dynamically assign a type using this method: [Dynamic templates | Elasticsearch Guide [7.15] | Elastic]

so a sample doc... instead of:

{
    "id": "12345",
    "parent": "item_4567",
    "stringFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "counterParty",
            "value": "ACME Inc."`
        }, ...
    ],
    "numberFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "amount",
            "value": 100001
        }, ...
    ],
    "dateFields": [
        {
            "typeKey": "contracts",
            "fieldKey": "expiration",
            "value": "2020-01-24"
        }, ...
    ],
}

it would be like this (or however your structure aligns with use case):

{
    "id": "12345",`
   "ItemNumber_ACMEInc": "item_4567",
   "contractAmount_ACMEInc": 100001,`
    "contractExpiration_ACMEInc": "2020-01-24"`
}

the prefix defines the data type, the suffix, is up to the customer, and it's just an optional key

Hi Charles. Thanks so much for the reply.

We are in fact expecting 100k's of fields per index. We expect 100's of fields per customer and ~100k's customers. While any individual document will probably have a less than 50 of those fields. We had been planning on using a single multi-tenant ES cluster/index. Which is why we turned to nested to avoid mapping explosion over the dynamic fields approaches.

I'm curious what problems you think we'll hit with nested if we continue down this path. We have some lucene experts (in the Solr space) in house and they felt like this nested approach would probably be workable from a lucene standpoint, but I'm curious if there are ES landmines that we might trigger.

OK, I checked some more, the gist of feedback was "if it’s the only way to handle mappings explosion I guess so.."

What this means is, it might be a case where Nested could work for you. The thing to be aware of here is nested causes fixed_bitset_cache usage so at huge document scale all the heap will be used by tracking whats the main doc and the nested special docs, for each nested object you need one in the doc

Lastly a rough idea, is as long as they stay under billions of docs you might make it...

However, be aware the feedback I got was this lack of schema really hurts when you have to be ready to handle anything.

Most likely it’s not going to be fast and it’s going to not compress well on disk.

It would be much better to try to force a schema, rather than allow this kind of schema explosion.

My assumption was that there'd be some amount of schema under the covers to assist with compression and performance (e.g. dateFields_typeKey, dateFields_fieldKey, dateFields_value) would be the 3 underlying fields tied to a doc_id and parent doc_id under the covers to allow for fairly scalable filters, intersections, and sorts) over these seemingly unstructured values.

Is that actually the way this works or does it do something with worse latency @ scale?