How to sub-aggregate an array that contains another field?

veqryn · October 12, 2018, 4:48am

So I have data that looks roughly like this:

{
	"tags": [
		"date=09/26/18",
		"member=Y",
		"type=newsletter",
		"type=call"
	]
}

I am making use of the fields mapping to create two additional fields:
tags.key
tags.val

So for the string "member=Y"
tags.key ends up containing the prefix: "member"
tags.val ends up containing the suffix: "Y"

I would like to perform an aggregation that give me back an aggregation of keys, with each of their values.

Something that would look roughly like this, or similar:

{
	"aggregations": {
		"key_agg": {
			"buckets": [{
				"key": "type",
				"doc_count": 121724,
				"key_tag_agg": {
					"buckets": [{
						"key": "call",
						"doc_count": 59025
					}, {
						"key": "newsletter",
						"doc_count": 58637
					}]
				}
			}]
		}
	}
}

Instead, whenever I try I seem to get every single permutation for each key field:

{
	"aggregations": {
		"key_agg": {
			"buckets": [{
				"key": "member",
				"doc_count": 121724,
				"key_tag_agg": {
					"buckets": [{
						"key": "type=call",
						"doc_count": 59025
					}, {
						"key": "type=newsletter",
						"doc_count": 58637
					}, {
						"key": "tenure=24",
						"doc_count": 508637
					}, {
						"key": "date=9/27/2018",
						"doc_count": 49943
					}, {
						"key": "date=10/03/2018",
						"doc_count": 49413
					}, {
						"key": "member=Y",
						"doc_count": 45549
					}]
				}
			}]
		}
	}
}

You can see above that even though we are aggregating the keys, and that bucket is for key "type", it still gives values such as "member=Y", even though it doesn't start with "type".

I've tried things like:

  "aggs": {
    "categories": {
      "terms": {
        "field": "tags.key"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "tags.val"
          }
        }
      }
    }
  }

And also tried scripting:

  "aggs": {
    "categories": {
      "terms": {
        "field": "tags.key"
      },
      "aggs": {
        "categories": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": """
                if (doc['tags'].value.startsWith(doc['tags.key'].value)) {
                  return doc['tags'].value
                }
                return ''
              """
            }
          }
        }
      }
    }
  }

Is there any potential solution for me?
I can use painless scripts, but no other scripting is available.

thanks,
Chris

abdon · October 12, 2018, 11:02am

You can model the tags field as a nested type. Then you can used a nested aggregation to get to what you want.

veqryn · October 13, 2018, 6:04am

Ok, so I tried to get this working. Here is my trial nested type mapping:

"ntags": {
  "type": "nested",
  "dynamic": false,
  "properties": {
    "key": {
      "type": "keyword",
      "similarity": "boolean",
      "normalizer": "key_normalizer"
    },
    "val": {
      "type": "keyword",
      "similarity": "boolean",
      "normalizer": "value_normalizer"
    }
  }
}

The key_normalizer just takes the string "foo=bar" and turns it into "foo". The value_normalizer turns it into "bar".

My data is coming in like:
{"tags":["foo=bar", "blah=blarg"]}

I thought I would try adding a nested field as a second field in the mapping (a multi-field).
Turns out you can't, I get this error:
Type [nested] cannot be used in multi field

Ok, so maybe if I added the nested type field as just another field at the root level, and then use copy_to to copy the tags field over to the ntags.key and ntags.value?
Turns out you can't do that either:
Illegal combination of [copy_to] and [nested] mappings: [copy_to] may only copy data to the current nested document or any of its parents, however one [copy_to] directive is trying to copy data from nested object [null] to [ntags]

Alright, so what if I just replace the tags mapping with a nested type, and try to use a normalizer to convert the raw string into the key and value?
Nope, that doesn't work either, it expects a property block instead of a string:
object mapping for [tags] tried to parse field [null] as object, but found a concrete value

I can't change the way my data is coming in. Is there anything else I can try doing?
Is there some way to script an array of strings into a nested type (either at index time or at search time)?

abdon · October 13, 2018, 5:51pm

You say you cannot change the way the data is coming in. Is there a way for you to specify a pipeline parameter when you index documents? If so, you can make use of an ingest pipeline to pre-process your documents before they are indexed. For example with a script processor that splits your tags into keys and values. The example below shows how that would work:

# Define a pipeline that splits tags on the equals sign
PUT _ingest/pipeline/my-pipeline
{
  "description": "Splits tags on equals sign",
  "processors": [
    {
      "script": {
          "source": """
  ctx.ntags = [];
  
  for (tag in ctx.tags) {
    StringTokenizer str = new StringTokenizer(tag, '=');
    def key = str.nextElement();
    def value = str.nextElement();
    
    ctx.ntags.add(['key': key, 'value': value])
  }
          """
        }
    }
  ]
}

# Test the pipeline with _simulate
POST _ingest/pipeline/my-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "tags": [
          "foo=bar",
          "blah=blarg"
        ]
      }
    }
  ]
}

# Index a document with the pipeline
PUT test/_doc/1?pipeline=my-pipeline
{
  "tags": [
    "foo=bar",
    "blah=blarg"
  ]
}

# Retrieve the document to validate that it works
GET test/_doc/1

If there is no way for you to specify a pipeline parameter on the index requests, you could wait until Elasticsearch 6.5 will be released. That version will introduce the ability to specify a default_pipeline on an index so you won't have to specify it with the index request. 6.5 will be the next minor release, so it should not be too long until that version will be released.

system · November 10, 2018, 5:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SubAggregations with array fields - more results than expected Elasticsearch	1	423	July 5, 2017
_field_names aggregation with sub-aggregation Elasticsearch	3	1082	July 5, 2017
Group Values with Aggregation Elasticsearch	8	1534	October 21, 2019
Aggregation on substring of a specific field Elasticsearch	5	1400	May 6, 2021
Aggregations key with sub array Elasticsearch	1	388	December 31, 2018

How to sub-aggregate an array that contains another field?

Related topics