Aggregating matched tokens

Hi!

Is there anyway to aggregate the "matched" tokens only for a particular field? Let's say we have a field where we tokenize it using whitespaces, and we have documents such as:

  • "THE PINK PANTHER"
  • "THE PUNK BAND"

And if we create regexp query P.NK we'd like to aggregate the PINK and and PUNK only (without THE token rising on top). Is there a way to achieve this?

Thanks!

Hi Pyppe,
I'm not sure what you mean by "matched" tokens. At first sight, I don't see a way to get all possible tokens that matched your text field (as in an inverted index query), and simultaneously aggregate on them (aggregrate implies that you have a keyword or fielddata).
Fielddata would allow to aggregate on single tokens, but the problem is then : how to filter which token matched, and keep only corresponding buckets.
On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward.

By matched token I mean if "THE PINK PANTHER" has tokens [THE, PINK, PANTHER], and query being P.NK, the matched token would be PINK. That is, when utilizing a whitespace tokenizer.

"On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward."

Haven't ever done something like this. Any tips for relevant documentation about using this runtime field, and running terms aggregation on it?

First, you can read this blog post as an introduction Getting started with runtime fields, Elastic’s implementation of schema on read | Elastic Blog . Especially the first part, where a runtime field is declared inside a _search request. If the runtime field is of type keyword , then it can be used in a terms aggregation, as any keyword field from your mapping.

In the following example, I have a author field, possibly multi-valued, which can contain values like
"THE PINK PANTHER"
"PUNK BAND"
"STEVE PINK"
I have added "fielddata": true in the mapping for this text field, as an optimization. It is more efficient as it enables the script to get data from doc_values, but you can also use data from the _source (for ref: Map a runtime field | Elasticsearch Guide [8.2] | Elastic).
The following query should do (more or less) what you want:

GET myindex/_search
{
  "runtime_mappings": {
    "toto": {
      "type": "keyword",
      "script": {
        "source": """
        List authors = doc['authors'];
        for (String s : authors)
        {
          def m = /p.nk/.matcher(s);
          if (m.find())
          emit (m.group());
        }
        """
      }
    }
  },
  "aggs": {
    "nom": {
      "terms": {
        "field": "toto",
        "size": 10
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.