Creating a mapping when nested object has too many unique random keys

jb789 · June 1, 2020, 10:11pm

Wondering how to create a mapping for the below document when the nested object key has too many unique / unknown keys in just one document.

Document Link

When run as dynamic mapping, I receive the following error

{
  "took" : 2371,
  "errors" : true,
  "items" : [
    {
      "index" : {
        "_index" : "pricing",
        "_type" : "redshift",
        "_id" : "jvOxcHIBZBHEiq-WwVYW",
        "status" : 400,
        "error" : {
          "type" : "illegal_argument_exception",
          "reason" : "Limit of total fields [1000] in index [pricing] has been exceeded"
        }
      }
    }
  ]
}

How would you create a mapping if the key within the object is unknown?

For example: The products is a nested object, but can have multiple keys for each version of the product.

{
    "formatVersion": "v1.0",
    "disclaimer": "This pricing list is for informational purposes only. All prices are subject to the additional terms included in the pricing pages on http://aws.amazon.com. All Free Tier prices are also subject to the terms included at https://aws.amazon.com/free/",
    "offerCode": "AmazonRedshift",
    "version": "20200511174831",
    "publicationDate": "2020-05-11T17:48:31Z",
    "products": {
        "KNVR5PFC54PXV76M": {
            "sku": "KNVR5PFC54PXV76M",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "Africa (Cape Town)",
                "locationType": "AWS Region",
                "instanceType": "dc2.large",
                "currentGeneration": "Yes",
                "vcpu": "2",
                "memory": "15 GiB",
                "storage": "0.16TB SSD",
                "io": "0.60 GB/s",
                "usagetype": "AFS1-Node:dc2.large",
                "operation": "RunComputeNode:0001",
                "ecu": "7",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        },
        "UUE32SQ6PQ9F6JHJ": {
            "sku": "UUE32SQ6PQ9F6JHJ",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "US East (Ohio)",
                "locationType": "AWS Region",
                "instanceType": "dc1.8xlarge",
                "currentGeneration": "No",
                "vcpu": "32",
                "memory": "244 GiB",
                "storage": "2.56TB SSD",
                "io": "3.70 GB/s",
                "usagetype": "USE2-Node:dw2.8xlarge",
                "operation": "RunComputeNode:0001",
                "ecu": "104",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        }
    },
    "terms": {
        "OnDemand": {
            "KNVR5PFC54PXV76M": {
                "KNVR5PFC54PXV76M.JRTCKXETXF": {
                    "offerTermCode": "JRTCKXETXF",
                    "sku": "KNVR5PFC54PXV76M",
                    "effectiveDate": "2020-04-01T00:00:00Z",
                    "priceDimensions": {
                        "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7": {
                            "rateCode": "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7",
                            "description": "$0.357 per Redshift Dense Compute Large (DC2.L) Compute Node-hour (or partial hour)",
                            "beginRange": "0",
                            "endRange": "Inf",
                            "unit": "Hrs",
                            "pricePerUnit": {
                                "USD": "0.3570000000"
                            },
                            "appliesTo": []
                        }
                    },
                    "termAttributes": {}
                }
            }
        },
        "Reserved": {
            "U4V6Y3USKUYCB6Q5": {
                "U4V6Y3USKUYCB6Q5.6QCMYABX3D": {
                    "offerTermCode": "6QCMYABX3D",
                    "sku": "U4V6Y3USKUYCB6Q5",
                    "effectiveDate": "2016-05-31T23:59:59Z",
                    "priceDimensions": {
                        "U4V6Y3USKUYCB6Q5.6QCMYABX3D.2TG2D8R56U": {
                            "rateCode": "U4V6Y3USKUYCB6Q5.6QCMYABX3D.2TG2D8R56U",
                            "description": "Upfront Fee",
                            "unit": "Quantity",
                            "pricePerUnit": {
                                "USD": "34800"
                            },
                            "appliesTo": []
                        },
                        "U4V6Y3USKUYCB6Q5.6QCMYABX3D.6YS6EN2CT7": {
                            "rateCode": "U4V6Y3USKUYCB6Q5.6QCMYABX3D.6YS6EN2CT7",
                            "description": "USD 0.0 per Redshift, dw2.8xlarge reserved instance applied",
                            "beginRange": "0",
                            "endRange": "Inf",
                            "unit": "Hrs",
                            "pricePerUnit": {
                                "USD": "0.0000000000"
                            },
                            "appliesTo": []
                        }
                    },
                    "termAttributes": {
                        "LeaseContractLength": "1yr",
                        "OfferingClass": "standard",
                        "PurchaseOption": "All Upfront"
                    }
                }
            }
        }
    }
}

Tried to flatten the object, but still the search results entire document...

GET /amzpricereportnew/_search
{
    "query": {
        "nested": {
            "path": "products",
            "query": {
                "bool": {
                    "must": [
                        {"match": {"products.KNVR5PFC54PXV76M.attributes.instanceType": "dc2.large"}}
                    ]
                }
            }
        }
    }
}

Most documentation for nested objects talks about keys that are known already. In this case, we don't know the keys beforehand besides they being unique. If I were to run the query, the entire document gets returned.

What we are trying to find is the cost of a product from the terms object. Is there any best practices for designing index mappings for these schema which have higher cardinality?

Christian_Dahlqvist · June 1, 2020, 10:48pm

I would recommend changing the structure of the document as it seems the high cardinss as pity fields are already present within the structures under them which means the can be searched for. For the first part change products to be an array and remove the fields named based on the sku value. This will give you documents, flattened or not, that have controlled mappings and can be indexed and searched. Unless you make these changes to remove the dynamic fields you will not be able to index the data.

jb789 · June 1, 2020, 11:21pm

Thanks Chris. I tried to break it down like you mentioned

{
     "products": [
        {
            "sku": "KNVR5PFC54PXV76M",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "Africa (Cape Town)",
                "locationType": "AWS Region",
                "instanceType": "dc2.large",
                "currentGeneration": "Yes",
                "vcpu": "2",
                "memory": "15 GiB",
                "storage": "0.16TB SSD",
                "io": "0.60 GB/s",
                "usagetype": "AFS1-Node:dc2.large",
                "operation": "RunComputeNode:0001",
                "ecu": "7",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        },
        {
            "sku": "UUE32SQ6PQ9F6JHJ",
            "productFamily": "Compute Instance",
            "attributes": {
                "servicecode": "AmazonRedshift",
                "location": "US East (Ohio)",
                "locationType": "AWS Region",
                "instanceType": "dc1.8xlarge",
                "currentGeneration": "No",
                "vcpu": "32",
                "memory": "244 GiB",
                "storage": "2.56TB SSD",
                "io": "3.70 GB/s",
                "usagetype": "USE2-Node:dw2.8xlarge",
                "operation": "RunComputeNode:0001",
                "ecu": "104",
                "servicename": "Amazon Redshift",
                "usageFamily": "Dense Compute"
            }
        }
],
 "terms": {
        "OnDemand": [
            {
                "offerTermCode": "JRTCKXETXF",
                "sku": "KNVR5PFC54PXV76M",
                "effectiveDate": "2020-04-01T00:00:00Z",
                "priceDimensions": {
                    "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7": {
                        "rateCode": "KNVR5PFC54PXV76M.JRTCKXETXF.6YS6EN2CT7",
                        "description": "$0.357 per Redshift Dense Compute Large (DC2.L) Compute Node-hour (or partial hour)",
                        "beginRange": "0",
                        "endRange": "Inf",
                        "unit": "Hrs",
                        "pricePerUnit": {
                            "USD": "0.3570000000"
                        },
                        "appliesTo": []
                    }
                },
                "termAttributes": {}
            }
        ]
 }
}

Below is the mapping after indexing the document.

{"awspr":{"mappings":{"properties":{"disclaimer":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"formatVersion":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"offerCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"products":{"properties":{"attributes":{"properties":{"concurrencyscalingfreeusage":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"currentGeneration":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"ecu":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"instanceType":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"io":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"location":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"locationType":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"memory":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"operation":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"servicecode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"servicename":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"storage":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"usageFamily":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"usagetype":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"vcpu":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"productFamily":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"publicationDate":{"type":"date"},"terms":{"properties":{"OnDemand":{"properties":{"effectiveDate":{"type":"date"},"offerTermCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"priceDimensions":{"properties":{"KNVR5PFC54PXV76M":{"properties":{"JRTCKXETXF":{"properties":{"6YS6EN2CT7":{"properties":{"beginRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"endRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"termAttributes":{"type":"object"}}},"Reserved":{"properties":{"effectiveDate":{"type":"date"},"offerTermCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"priceDimensions":{"properties":{"U4V6Y3USKUYCB6Q5":{"properties":{"6QCMYABX3D":{"properties":{"2TG2D8R56U":{"properties":{"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"6YS6EN2CT7":{"properties":{"beginRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"description":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"endRange":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"pricePerUnit":{"properties":{"USD":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"rateCode":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"unit":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}}}},"sku":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"termAttributes":{"properties":{"LeaseContractLength":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"OfferingClass":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"PurchaseOption":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}}},"version":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}}}}

Still the query yields all sku if I were to search by a specific instanceType from products. I indexed it directly from a file where the entire document is in 1 line... like a bulk upload... I see it kind of treats the entire product array as 1 document. Should I process it via logstash, loop over the array and index?

Christian_Dahlqvist · June 2, 2020, 5:12am

Do the same for priceDimensions to remove all dynamic field names. Split it not different documents depending on how you want to query them and return results. You can do this in Logstash using a split filter.

jb789 · June 4, 2020, 3:01pm

I had doubts if there is some feature that ES has which can still index the data as is because I am noob to ES, I wasn't sure if I was missing something... Your inputs suggest that the data massaging route is the way to go. I will stick to it. Thanks Christian...

system · July 2, 2020, 3:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mapping for nested object with dynamic keys, uniform values Elasticsearch	4	2574	July 6, 2017
Need help mapping dynamic keys Elasticsearch	1	329	June 26, 2019
Unknown key [productid] for create index Elasticsearch	2	3117	July 8, 2020
Getting Unique Set of nested objects within aggregations Elasticsearch	2	637	May 8, 2022
Dynamic Nested Objects Mapping Elasticsearch	6	1993	April 3, 2018

Creating a mapping when nested object has too many unique random keys

Related topics