How to sub-aggregate an array that contains another field?

So I have data that looks roughly like this:

{
	"tags": [
		"date=09/26/18",
		"member=Y",
		"type=newsletter",
		"type=call"
	]
}

I am making use of the fields mapping to create two additional fields:
tags.key
tags.val

So for the string "member=Y"
tags.key ends up containing the prefix: "member"
tags.val ends up containing the suffix: "Y"

I would like to perform an aggregation that give me back an aggregation of keys, with each of their values.

Something that would look roughly like this, or similar:

{
	"aggregations": {
		"key_agg": {
			"buckets": [{
				"key": "type",
				"doc_count": 121724,
				"key_tag_agg": {
					"buckets": [{
						"key": "call",
						"doc_count": 59025
					}, {
						"key": "newsletter",
						"doc_count": 58637
					}]
				}
			}]
		}
	}
}

Instead, whenever I try I seem to get every single permutation for each key field:

{
	"aggregations": {
		"key_agg": {
			"buckets": [{
				"key": "member",
				"doc_count": 121724,
				"key_tag_agg": {
					"buckets": [{
						"key": "type=call",
						"doc_count": 59025
					}, {
						"key": "type=newsletter",
						"doc_count": 58637
					}, {
						"key": "tenure=24",
						"doc_count": 508637
					}, {
						"key": "date=9/27/2018",
						"doc_count": 49943
					}, {
						"key": "date=10/03/2018",
						"doc_count": 49413
					}, {
						"key": "member=Y",
						"doc_count": 45549
					}]
				}
			}]
		}
	}
}

You can see above that even though we are aggregating the keys, and that bucket is for key "type", it still gives values such as "member=Y", even though it doesn't start with "type".

I've tried things like:

  "aggs": {
    "categories": {
      "terms": {
        "field": "tags.key"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "tags.val"
          }
        }
      }
    }
  }

And also tried scripting:

  "aggs": {
    "categories": {
      "terms": {
        "field": "tags.key"
      },
      "aggs": {
        "categories": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": """
                if (doc['tags'].value.startsWith(doc['tags.key'].value)) {
                  return doc['tags'].value
                }
                return ''
              """
            }
          }
        }
      }
    }
  }

Is there any potential solution for me?
I can use painless scripts, but no other scripting is available.

thanks,
Chris

You can model the tags field as a nested type. Then you can used a nested aggregation to get to what you want.

Ok, so I tried to get this working. Here is my trial nested type mapping:

"ntags": {
  "type": "nested",
  "dynamic": false,
  "properties": {
    "key": {
      "type": "keyword",
      "similarity": "boolean",
      "normalizer": "key_normalizer"
    },
    "val": {
      "type": "keyword",
      "similarity": "boolean",
      "normalizer": "value_normalizer"
    }
  }
}

The key_normalizer just takes the string "foo=bar" and turns it into "foo". The value_normalizer turns it into "bar".

My data is coming in like:
{"tags":["foo=bar", "blah=blarg"]}

I thought I would try adding a nested field as a second field in the mapping (a multi-field).
Turns out you can't, I get this error:
Type [nested] cannot be used in multi field

Ok, so maybe if I added the nested type field as just another field at the root level, and then use copy_to to copy the tags field over to the ntags.key and ntags.value?
Turns out you can't do that either:
Illegal combination of [copy_to] and [nested] mappings: [copy_to] may only copy data to the current nested document or any of its parents, however one [copy_to] directive is trying to copy data from nested object [null] to [ntags]

Alright, so what if I just replace the tags mapping with a nested type, and try to use a normalizer to convert the raw string into the key and value?
Nope, that doesn't work either, it expects a property block instead of a string:
object mapping for [tags] tried to parse field [null] as object, but found a concrete value

I can't change the way my data is coming in. Is there anything else I can try doing?
Is there some way to script an array of strings into a nested type (either at index time or at search time)?

You say you cannot change the way the data is coming in. Is there a way for you to specify a pipeline parameter when you index documents? If so, you can make use of an ingest pipeline to pre-process your documents before they are indexed. For example with a script processor that splits your tags into keys and values. The example below shows how that would work:

# Define a pipeline that splits tags on the equals sign
PUT _ingest/pipeline/my-pipeline
{
  "description": "Splits tags on equals sign",
  "processors": [
    {
      "script": {
          "source": """
  ctx.ntags = [];
  
  for (tag in ctx.tags) {
    StringTokenizer str = new StringTokenizer(tag, '=');
    def key = str.nextElement();
    def value = str.nextElement();
    
    ctx.ntags.add(['key': key, 'value': value])
  }
          """
        }
    }
  ]
}

# Test the pipeline with _simulate
POST _ingest/pipeline/my-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "tags": [
          "foo=bar",
          "blah=blarg"
        ]
      }
    }
  ]
}

# Index a document with the pipeline
PUT test/_doc/1?pipeline=my-pipeline
{
  "tags": [
    "foo=bar",
    "blah=blarg"
  ]
}

# Retrieve the document to validate that it works
GET test/_doc/1

If there is no way for you to specify a pipeline parameter on the index requests, you could wait until Elasticsearch 6.5 will be released. That version will introduce the ability to specify a default_pipeline on an index so you won't have to specify it with the index request. 6.5 will be the next minor release, so it should not be too long until that version will be released.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.