Aggregating matched tokens

Pyppe · May 24, 2022, 6:50am

Hi!

Is there anyway to aggregate the "matched" tokens only for a particular field? Let's say we have a field where we tokenize it using whitespaces, and we have documents such as:

"THE PINK PANTHER"
"THE PUNK BAND"

And if we create regexp query P.NK we'd like to aggregate the PINK and and PUNK only (without THE token rising on top). Is there a way to achieve this?

Thanks!

vincenbr · May 24, 2022, 9:54am

Hi Pyppe,
I'm not sure what you mean by "matched" tokens. At first sight, I don't see a way to get all possible tokens that matched your text field (as in an inverted index query), and simultaneously aggregate on them (aggregrate implies that you have a keyword or fielddata).
Fielddata would allow to aggregate on single tokens, but the problem is then : how to filter which token matched, and keep only corresponding buckets.
On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward.

Pyppe · May 24, 2022, 10:15am

By matched token I mean if "THE PINK PANTHER" has tokens [THE, PINK, PANTHER], and query being P.NK, the matched token would be PINK. That is, when utilizing a whitespace tokenizer.

"On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward."

Haven't ever done something like this. Any tips for relevant documentation about using this runtime field, and running terms aggregation on it?

vincenbr · May 25, 2022, 12:23am

First, you can read this blog post as an introduction Getting started with runtime fields, Elastic’s implementation of schema on read | Elastic Blog . Especially the first part, where a runtime field is declared inside a _search request. If the runtime field is of type keyword , then it can be used in a terms aggregation, as any keyword field from your mapping.

In the following example, I have a author field, possibly multi-valued, which can contain values like
"THE PINK PANTHER"
"PUNK BAND"
"STEVE PINK"
I have added "fielddata": true in the mapping for this text field, as an optimization. It is more efficient as it enables the script to get data from doc_values, but you can also use data from the _source (for ref: Map a runtime field | Elasticsearch Guide [8.2] | Elastic).
The following query should do (more or less) what you want:

GET myindex/_search
{
  "runtime_mappings": {
    "toto": {
      "type": "keyword",
      "script": {
        "source": """
        List authors = doc['authors'];
        for (String s : authors)
        {
          def m = /p.nk/.matcher(s);
          if (m.find())
          emit (m.group());
        }
        """
      }
    }
  },
  "aggs": {
    "nom": {
      "terms": {
        "field": "toto",
        "size": 10
      }
    }
  }
}

Topic		Replies	Views
Terms aggregation is breaking field into tokens Elasticsearch	1	724	March 7, 2016
Exclude specific terms from term aggregation's buckets list Elasticsearch	10	15231	June 1, 2018
Tokenizing using runtime-fields Kibana runtime-fields	2	420	November 26, 2022
Terms aggregation ignoring analyzers? Elasticsearch	3	504	May 4, 2018
Significant terms aggregation with non tokenized text Elasticsearch	1	503	September 26, 2014

Aggregating matched tokens

Related topics