Need help to aggregate a text field into a visualization

Hi All, I've spent a ton of time on this and I feel like it should be relatively simple but for some reason I'm not able to find the solution. I'll also add that I am fairly novice at visualizations. The docs looks similar to:

{ "content": "tree tree tree bush field chicken", "words": { "text": "tree", "text": "tree", "text": "tree", "text": "bush", "text": "field", "text": "chicken" } }

In the resulting query I am looking for a result like:

tree: 3
bush: 1
field: 1
chicken: 1

however when I run a count across the word.text field I am consistently getting:
tree: 1
bush: 1
field: 1
chicken: 1

@trubisky welcome to the community! You posted something that is not valid JSON, so I am assuming that what you meant is that you have a document like this. Please confirm:

{ "content": "...", "words": [{ "text": "tree" }, ...array of documents] }

When Elasticsearch returns the count of that document, it is always going to be 1 because it's 1 document. However, there are some options for you:

  1. You can add a new top-level field to each document using scripting, so that the number is able to be calculated with the Sum aggregation. There are three ways to do this:
    a. You can pre-extract the counts before sending the data to ES
    b. You can set up an Elasticsearch Ingestion Pipeline to automatically run a script on every new document
    c. You can create a Kibana runtime field (7.12 and newer) or Kibana scripted field (all versions) using Painless scripts to calculate the counts

  2. You can change the structure of your documents entirely, so that each word has its own document. This would let Elasticsearch count them all separately.

@wylie thank you for the response and catching my copy paste error (what i sent isn't the actual doc, i trimmed the fat to just what's relevant). I understand your approach on scripting the input to transform it to match what i need in the report, but not quite sure what the object should look like in the index. I think you are suggesting that I extract the counts so I'll have an arrary like "tree": 3, "bush": 1 "field": 1 "chicken": 1. But how can i stored that in ES in a manner that i can build a data table displaying those values?

Sorry, I think I may have been too focused on telling you to restructure your data earlier. It is a pretty common problem.

Elasticsearch is really good at counting frequencies of things and surfacing the most frequent things using the Terms aggregation. We have a Kibana sample data set that resembles your use case, where it tracks ecommerce sales. Each sale has an array of products that are sold, and we can use the Terms aggregation on this array field to find the most-frequently-sold products with a count that is higher than the overall count of transactions.

So basically, your original example had a fine structure. You could even simplify it more to words: [tree, tree, tree, bush, field, chicken] and Elasticsearch will index them the same way.
If you then use a Terms aggregation on the words or words.text array, the Count of each term will represent the total usage. So you'd get buckets: [{ key: tree, doc_count: 3 }, ... from ES in this case.

The biggest caveat here is that Elasticsearch treats all your frequencies separately. So if you have words.text and words.other_info in the same object, there is no correlation between those two.

To verify what I'm saying, I recommend testing out the ecommerce sample data. Here is a sample query:

POST kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "most_frequent_categories": {
      "terms": {
        "field": "category.keyword"
      }
    },
    "total_from_buckets": {
      "sum_bucket": {
        "buckets_path": "most_frequent_categories>_count"
      }
    }
  }
}

So I just tried an example and simply posted a doc like:

{
"_index": "transcriptions",
"_type": "_doc",
"_id": "3",
"_version": 1,
"_score": 1,
"_source": {
"terms": [
"tree",
"tree",
"tree",
"bush",
"field",
"chicken"
]
}
}

Then ran this query:

POST transcriptions/_search
{
"size": 0,
"aggs": {
"most_frequent_categories": {
"terms": {
"field": "terms.keyword"
}
},
"total_from_buckets": {
"sum_bucket": {
"buckets_path": "most_frequent_categories>_count"
}
}
}
}

And got:

  "buckets" : [
    {
      "key" : "bush",
      "doc_count" : 1
    },
    {
      "key" : "chicken",
      "doc_count" : 1
    },
    {
      "key" : "field",
      "doc_count" : 1
    },
    {
      "key" : "tree",
      "doc_count" : 1
    }
  ]

whereas I am looking for { "key" : "tree", "doc_count" : 3 }

Even tried with the sample e-commerce data and still i can see in the products array there might be 2 products listed as "Women's Clothing" and in the above query it will only account for a count of 1 "Womens' Clothing" category (which does in fact make sense in the context of that dataset).

Okay. So assuming that you don't have a full dictionary before starting, you will need to structure your data so that every document contains exactly one word. For example, you could assign each word a unique ID, something like this:

PUT words/_doc/word-tree
{
  word: 'tree',
  count: 3
}

and then when you need to update it, you can either run a new PUT request to replace the document, or if you need concurrency you should use a scripted update:

POST words/_update_by_query
{
  "query": { "term": { "word": "tree" } },
  "script": {
    "source": "ctx._source.count = ctx._source.count + params.increment",
    "params": {
      "increment": 2
    }
  }
}

Given the structure above you can run a regular Terms aggregation on the words field, and you can get the total number of instances by running the Sum aggregation on the count field.