Is there anyway to aggregate the "matched" tokens only for a particular field? Let's say we have a field where we tokenize it using whitespaces, and we have documents such as:
"THE PINK PANTHER"
"THE PUNK BAND"
And if we create regexp query P.NK we'd like to aggregate the PINK and and PUNK only (without THE token rising on top). Is there a way to achieve this?
Hi Pyppe,
I'm not sure what you mean by "matched" tokens. At first sight, I don't see a way to get all possible tokens that matched your text field (as in an inverted index query), and simultaneously aggregate on them (aggregrate implies that you have a keyword or fielddata).
Fielddata would allow to aggregate on single tokens, but the problem is then : how to filter which token matched, and keep only corresponding buckets.
On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward.
By matched token I mean if "THE PINK PANTHER" has tokens [THE, PINK, PANTHER], and query being P.NK, the matched token would be PINK. That is, when utilizing a whitespace tokenizer.
"On the other hand (regexp approach) you could use a runtime field to extract the tokens that match your regexp in a keyword field, and run a terms aggregation on this field, which is pretty straightforward."
Haven't ever done something like this. Any tips for relevant documentation about using this runtime field, and running terms aggregation on it?
In the following example, I have a author field, possibly multi-valued, which can contain values like
"THE PINK PANTHER"
"PUNK BAND"
"STEVE PINK"
I have added "fielddata": true in the mapping for this text field, as an optimization. It is more efficient as it enables the script to get data from doc_values, but you can also use data from the _source (for ref: Map a runtime field | Elasticsearch Guide [8.2] | Elastic).
The following query should do (more or less) what you want:
GET myindex/_search
{
"runtime_mappings": {
"toto": {
"type": "keyword",
"script": {
"source": """
List authors = doc['authors'];
for (String s : authors)
{
def m = /p.nk/.matcher(s);
if (m.find())
emit (m.group());
}
"""
}
}
},
"aggs": {
"nom": {
"terms": {
"field": "toto",
"size": 10
}
}
}
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.