Aggregate counting/sum query

I'm trying to figure out how to sum up many different counts in an elasticsearch index. A document in the index looks like this:

{
  '_source': {
    'my_field': 'Robert and Alex went with Robert to play in London and then Robert went to London',
    'ner': {
      'persons': {
        'Alex': 1,
        'Robert': 3
      },
      'organizations': {},
      'dates': {},
      'locations': {
        'London': 2
      }
    }
  }
}

How can I sum up all the different words in location, dates and persons in the index? For example if another document had 2 occurrences of Alex, I'd get Alex: 3, Robert: 3, ..

If you have a limited number of persons/organizations/dates/locations, and you know exactly what the values of those entities are going to be, then you can use a sum aggregation to sum the counts for each value individually:

GET my_index/_search
{
  "size": 0,
  "aggs": {
    "sum_alexes": {
      "sum": {
        "field": "ner.persons.Alex"
      }
    },
    "sum_roberts": {
      "sum": {
        "field": "ner.persons.Robert"
      }
    }
  }
}

It's probably not the case that you know all the values beforehand though? If you're dealing with a potentially large number of different values, and you do not know the values beforehand, then it's going to be a bit more work. I would suggest you restructure your documents using nested types.

For example, just taking persons into account, you could create your index with a mapping like this:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "ner": {
          "properties": {
            "persons": {
              "type": "nested",
              "properties": {
                "name": {
                  "type": "keyword"
                },
                "count": {
                  "type": "integer"
                }
              }
            }
          }
        }
      }
    }
  }
}

Now you will need to index your data providing an array of objects that each contain the "name" and "count" fields:

PUT my_index/_doc/1
{
  "my_field": "Robert and Alex went with Robert to play in London and then Robert went to London",
  "ner": {
    "persons": [
      {
        "name": "Alex",
        "count": 1
      },
      {
        "name": "Robert",
        "count": 3
      }
    ]
  }
}

PUT my_index/_doc/2
{
  "my_field": "Alex foo bar Alex",
  "ner": {
    "persons": [
      {
        "name": "Alex",
        "count": 2
      }
    ]
  }
}

You can now use a nested aggregation in combination with a terms aggregation and a sum aggregation to get a list of the for example 10 names with the highest total count:

GET my_index/_search
{
  "size": 0,
  "aggs": {
    "persons": {
      "nested": {
        "path": "ner.persons"
      },
      "aggs": {
        "top_names": {
          "terms": {
            "field": "ner.persons.name",
            "size": 10,
            "order": {
              "total_sum_count": "desc"
            }
          },
          "aggs": {
            "total_sum_count": {
              "sum": {
                "field": "ner.persons.count"
              }
            }
          }
        }
      }
    }
  }
}
2 Likes

This is great. Thank you. However I chose to go with a different approach for this. I noticed that in the Discover tab in Kibana, it shows you the most used words in a field so I changed the structure to be like this:

{
 persons: ['Alex', 'Alex', 'Robert']
}

I realize this is less efficient than you're suggestion but your suggestion is not working too well with Kibana.

Yeah, you're right - working with nested types is not really supported by Kibana.

I don't know what you're actually doing in Kibana, but there is one thing to be aware of with your approach. Even though you have added the word Alex twice to that persons field, if you run a terms aggregation on that field, the term Alex will only be counted once. The terms aggregation counts documents that contain the term - not individual occurrences of the term.

Sorry if this is a little unrelated to my original question, but I'm trying to figure out how to aggregate documents that follow a structure similar to my last comment. I'm trying to count the individual occurrences of the words (in other words, I'm trying to do what you stated in your last sentence), similar to what Kibana does automatically:

image

Also, I don't understand why Kibana doesn't show a visualize button for these fields.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.