Creating a mapping when nested object has too many unique random keys

Wondering how to create a mapping for the below document when the nested object key has too many unique / unknown keys in just one document.

Document Link

When run as dynamic mapping, I receive the following error

{
  "took" : 2371,
  "errors" : true,
  "items" : [
    {
      "index" : {
        "_index" : "pricing",
        "_type" : "redshift",
        "_id" : "jvOxcHIBZBHEiq-WwVYW",
        "status" : 400,
        "error" : {
          "type" : "illegal_argument_exception",
          "reason" : "Limit of total fields [1000] in index [pricing] has been exceeded"
        }
      }
    }
  ]
}

How would you create a mapping if the key within the object is unknown?

For example: The products is a nested object, but can have multiple keys for each version of the product.

{
    "formatVersion": "v1.0",
    "disclaimer": "This pricing list is for informational purposes only. All prices are subject to the additional terms included in the pricing pages on http://aws.amazon.com. All Free Tier prices are also subject to the terms included at https://aws.amazon.com/free/",
    "offerCode": "AmazonRedshift",
    "version": "20200511174831",
    "publicationDate": "2020-05-11T17:48:31Z",
    "products": {
        "KNVR5PFC54PXV76M": {
            "sku": "KNVR5PFC54PXV76M",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "Africa (Cape Town)",
                "locationType": "AWS Region",
                "instanceType": "dc2.large",
                "currentGeneration": "Yes",
                "vcpu": "2",
                "memory": "15 GiB",
                "storage": "0.16TB SSD",
                "io": "0.60 GB/s",
                "usagetype": "AFS1-Node:dc2.large",
                "operation": "RunComputeNode:0001",
                "ecu": "7",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        },
        "UUE32SQ6PQ9F6JHJ": {
            "sku": "UUE32SQ6PQ9F6JHJ",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "US East (Ohio)",
                "locationType": "AWS Region",
                "instanceType": "dc1.8xlarge",
                "currentGeneration": "No",
                "vcpu": "32",
                "memory": "244 GiB",
                "storage": "2.56TB SSD",
                "io": "3.70 GB/s",
                "usagetype": "USE2-Node:dw2.8xlarge",
                "operation": "RunComputeNode:0001",
                "ecu": "104",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        }
    },
    "terms": {
        "OnDemand": {
            "KNVR5PFC54PXV76M": {
                "KNVR5PFC54PXV76M.JRTCKXETXF": {
                    "offerTermCode": "JRTCKXETXF",
                    "sku": "KNVR5PFC54PXV76M",
                    "effectiveDate": "2020-04-01T00:00:00Z",
                    "priceDimensions": {
                        "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7": {
                            "rateCode": "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7",
                            "description": "$0.357 per Redshift Dense Compute Large (DC2.L) Compute Node-hour (or partial hour)",
                            "beginRange": "0",
                            "endRange": "Inf",
                            "unit": "Hrs",
                            "pricePerUnit": {
                                "USD": "0.3570000000"
                            },
                            "appliesTo": []
                        }
                    },
                    "termAttributes": {}
                }
            }
        },
        "Reserved": {
            "U4V6Y3USKUYCB6Q5": {
                "U4V6Y3USKUYCB6Q5.6QCMYABX3D": {
                    "offerTermCode": "6QCMYABX3D",
                    "sku": "U4V6Y3USKUYCB6Q5",
                    "effectiveDate": "2016-05-31T23:59:59Z",
                    "priceDimensions": {
                        "U4V6Y3USKUYCB6Q5.6QCMYABX3D.2TG2D8R56U": {
                            "rateCode": "U4V6Y3USKUYCB6Q5.6QCMYABX3D.2TG2D8R56U",
                            "description": "Upfront Fee",
                            "unit": "Quantity",
                            "pricePerUnit": {
                                "USD": "34800"
                            },
                            "appliesTo": []
                        },
                        "U4V6Y3USKUYCB6Q5.6QCMYABX3D.6YS6EN2CT7": {
                            "rateCode": "U4V6Y3USKUYCB6Q5.6QCMYABX3D.6YS6EN2CT7",
                            "description": "USD 0.0 per Redshift, dw2.8xlarge reserved instance applied",
                            "beginRange": "0",
                            "endRange": "Inf",
                            "unit": "Hrs",
                            "pricePerUnit": {
                                "USD": "0.0000000000"
                            },
                            "appliesTo": []
                        }
                    },
                    "termAttributes": {
                        "LeaseContractLength": "1yr",
                        "OfferingClass": "standard",
                        "PurchaseOption": "All Upfront"
                    }
                }
            }
        }
    }
}

Tried to flatten the object, but still the search results entire document...

GET /amzpricereportnew/_search
{
    "query": {
        "nested": {
            "path": "products",
            "query": {
                "bool": {
                    "must": [
                        {"match": {"products.KNVR5PFC54PXV76M.attributes.instanceType": "dc2.large"}}
                    ]
                }
            }
        }
    }
}

Most documentation for nested objects talks about keys that are known already. In this case, we don't know the keys beforehand besides they being unique. If I were to run the query, the entire document gets returned.

What we are trying to find is the cost of a product from the terms object. Is there any best practices for designing index mappings for these schema which have higher cardinality?

I would recommend changing the structure of the document as it seems the high cardinss as pity fields are already present within the structures under them which means the can be searched for. For the first part change products to be an array and remove the fields named based on the sku value. This will give you documents, flattened or not, that have controlled mappings and can be indexed and searched. Unless you make these changes to remove the dynamic fields you will not be able to index the data.

Thanks Chris. I tried to break it down like you mentioned

{
     "products": [
        {
            "sku": "KNVR5PFC54PXV76M",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "Africa (Cape Town)",
                "locationType": "AWS Region",
                "instanceType": "dc2.large",
                "currentGeneration": "Yes",
                "vcpu": "2",
                "memory": "15 GiB",
                "storage": "0.16TB SSD",
                "io": "0.60 GB/s",
                "usagetype": "AFS1-Node:dc2.large",
                "operation": "RunComputeNode:0001",
                "ecu": "7",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        },
        {
            "sku": "UUE32SQ6PQ9F6JHJ",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "US East (Ohio)",
                "locationType": "AWS Region",
                "instanceType": "dc1.8xlarge",
                "currentGeneration": "No",
                "vcpu": "32",
                "memory": "244 GiB",
                "storage": "2.56TB SSD",
                "io": "3.70 GB/s",
                "usagetype": "USE2-Node:dw2.8xlarge",
                "operation": "RunComputeNode:0001",
                "ecu": "104",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        }
],
 "terms": {
        "OnDemand": [
            {
                "offerTermCode": "JRTCKXETXF",
                "sku": "KNVR5PFC54PXV76M",
                "effectiveDate": "2020-04-01T00:00:00Z",
                "priceDimensions": {
                    "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7": {
                        "rateCode": "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7",
                        "description": "$0.357 per Redshift Dense Compute Large (DC2.L) Compute Node-hour (or partial hour)",
                        "beginRange": "0",
                        "endRange": "Inf",
                        "unit": "Hrs",
                        "pricePerUnit": {
                            "USD": "0.3570000000"
                        },
                        "appliesTo": []
                    }
                },
                "termAttributes": {}
            }
        ]
 }
}

Below is the mapping after indexing the document.

{"awspr":{"mappings":{"properties":{"disclaimer":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"formatVersion":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"offerCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"products":{"properties":{"attributes":{"properties":{"concurrencyscalingfreeusage":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"currentGeneration":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ecu":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"instanceType":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"io":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"location":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"locationType":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"memory":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"operation":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"servicecode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"servicename":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"storage":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"usageFamily":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"usagetype":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"vcpu":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"productFamily":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"publicationDate":{"type":"date"},"terms":{"properties":{"OnDemand":{"properties":{"effectiveDate":{"type":"date"},"offerTermCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"priceDimensions":{"properties":{"KNVR5PFC54PXV76M":{"properties":{"JRTCKXETXF":{"properties":{"6YS6EN2CT7":{"properties":{"beginRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"endRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"termAttributes":{"type":"object"}}},"Reserved":{"properties":{"effectiveDate":{"type":"date"},"offerTermCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"priceDimensions":{"properties":{"U4V6Y3USKUYCB6Q5":{"properties":{"6QCMYABX3D":{"properties":{"2TG2D8R56U":{"properties":{"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"6YS6EN2CT7":{"properties":{"beginRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"endRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"termAttributes":{"properties":{"LeaseContractLength":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"OfferingClass":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"PurchaseOption":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}},"version":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}

Still the query yields all sku if I were to search by a specific instanceType from products. I indexed it directly from a file where the entire document is in 1 line... like a bulk upload... I see it kind of treats the entire product array as 1 document. Should I process it via logstash, loop over the array and index?

Do the same for priceDimensions to remove all dynamic field names. Split it not different documents depending on how you want to query them and return results. You can do this in Logstash using a split filter.

I had doubts if there is some feature that ES has which can still index the data as is because I am noob to ES, I wasn't sure if I was missing something... Your inputs suggest that the data massaging route is the way to go. I will stick to it. Thanks Christian...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.