Count unique values inside primitive array

UglyBarnacle · September 14, 2020, 3:13pm

Just wanted to know if there is an elegant solution for my problem:

We sort of have an index that stores orders and on those orders is an array of zipcodes which the customer selected.

{
	"identifier": "1574417821835",
	"selectedZipCodes": [	"96106",	"96126",	"96145",	"96148",	"96151",	"96161",	"96166",	"96176",	"96179",	"96182",	"96184",	"96190",	"96250",	"96274",	"97500"	],
	"sortTimestamp": 1575673199999
}

now for visualisation on a map, we try to cluster those zipcodes depending on the zoom-level of the map.
I tried a couple of things with different aggregations and filters for the buckets, cardinality doesn't work, because it always counts all elements of the matched document, not the ones i filtered for (terms: include: regex)

I sure could handle it on the client side, but that means I would have to return all the zipcodes in buckets, e. g. up to 100k in edge-cases (partitioned terms in that case), which results in a huge json-payload.

this is what I got so far:

{
  "size": 0,
  "query":{"bool":{"filter":{"exists":{"field":"selectedZipCodes"}}}},
  "aggs": {
    "zips": {
      "filter": {
        "prefix": {
          "selectedZipCodes": "9"
        }
      },
      "aggs": {
        "9x_count": {
          "filter": {
            "prefix": {
              "selectedZipCodes": "9"
            }
          },
          "aggs": {
            "counts": {
              "terms": {
                "field": "selectedZipCodes",
                "include": "9.*"
              }
            }
          }
        },
        "96x": {
          "filter": {
            "prefix": {
              "selectedZipCodes": "96"
            }
          },
          "aggs": {
            "counts": {
              "terms": {
                "field": "selectedZipCodes",
                "include": "96.*"
              }
            }
          }
        },
        "97x": {
          "filter": {
            "prefix": {
              "selectedZipCodes": "97"
            }
          },
          "aggs": {
            "counts": {
              "terms": {
                "field": "selectedZipCodes",
                "include": "97.*"
              }
            }
          }
        }
      }
    }
  }
}

which just results in overcrowded buckets:

"aggregations": {
    "zips": {
      "doc_count": 211,
      "97x": {
        "doc_count": 198,
        "counts": {
          "doc_count_error_upper_bound": 1,
          "sum_other_doc_count": 211,
          "buckets": [
            {
              "key": "97500",
              "doc_count": 198
            },
            {
              "key": "97486",
              "doc_count": 25
            },
            {
              "key": "97496",
              "doc_count": 25
            },
            {
              "key": "97070",
              "doc_count": 1
            },
            {
              "key": "97072",
              "doc_count": 1
            },
            {
              "key": "97074",
              "doc_count": 1
            },
            {
              "key": "97076",
              "doc_count": 1
            },
            {
              "key": "97078",
              "doc_count": 1
            },
            {
              "key": "97080",
              "doc_count": 1
            },
            {
              "key": "97082",
              "doc_count": 1
            }
          ]
        }
      },
      "96x": {
        "doc_count": 206,
        "counts": {
          "doc_count_error_upper_bound": 1,
          "sum_other_doc_count": 488,
          "buckets": [
            {
              "key": "96106",
              "doc_count": 200
            },
            {
              "key": "96190",
              "doc_count": 200
            },
            {
              "key": "96176",
              "doc_count": 199
            },
            {
              "key": "96161",
              "doc_count": 198
            },
            {
              "key": "96179",
              "doc_count": 198
            },
            {
              "key": "96182",
              "doc_count": 198
            },
            {
              "key": "96184",
              "doc_count": 198
            },
            {
              "key": "96274",
              "doc_count": 198
            },
            {
              "key": "96151",
              "doc_count": 197
            },
            {
              "key": "96166",
              "doc_count": 197
            }
          ]
        }
      },
      "9x_count": {
        "doc_count": 211,
        "counts": {
          "doc_count_error_upper_bound": 4,
          "sum_other_doc_count": 1866,
          "buckets": [
            {
              "key": "96106",
              "doc_count": 200
            },
            {
              "key": "96190",
              "doc_count": 200
            },
            {
              "key": "96176",
              "doc_count": 199
            },
            {
              "key": "96161",
              "doc_count": 198
            },
            {
              "key": "96179",
              "doc_count": 198
            },
            {
              "key": "96182",
              "doc_count": 198
            },
            {
              "key": "96184",
              "doc_count": 198
            },
            {
              "key": "96274",
              "doc_count": 198
            },
            {
              "key": "97500",
              "doc_count": 198
            },
            {
              "key": "96151",
              "doc_count": 197
            }
          ]
        }
      }
    }
  }

maybe someone got an idea or something to kickstart from - or did I miss out the bucket_count_ aggregation that counts the terms without actually needing them? ^^

oh, expected would be something like this (structure omitted):

{
	"9x": {
		"count": 42,
		"96": {
			"count": 13,
			"960": {...}
		},
		"97": {
			"count": 9,
			"976": {...}
		}
		...
	},
	"6x": {...}
}

or different structure, just an idea

system · October 12, 2020, 3:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Counting the number of buckets matching certain criteria Elasticsearch	3	8743	July 5, 2017
Counting each key of an array inside aggregation Elasticsearch	5	3046	March 3, 2022
Unique count exact value Elasticsearch	1	623	June 10, 2021
Count arrays as an object Elasticsearch	2	384	July 21, 2023
How to get distinct values from elastic search Elasticsearch	1	570	May 25, 2020

Count unique values inside primitive array

Related topics