Mapping to analyze a list with a fixed amount of terms

In relation to this question Top terms in comma separated list

I am wondering if there is a way to skip the scripted fields and have that information natively in the index.

To give a bit more of context to that document: it has a foo attribute which is at the moment of type String. It represents user's choice for a given service (and there are a limited number of choices, say around 80 for the sake of the discussion. That list changes from time to time).

We're happy to reindex, change the data structure or whatever is necessary to get native support for this (that's our number one use case). We need to know which combination of terms were chosen for foo (the order doesn't matter and an empty list is also a valid combination).

To illustrate a bit more the problem, this query

GET /index-foo/_search
{
"aggs" : {
    "foos" : {
        "terms" : { "field" : "foo" }
    }
}}

returns the presence of each individual term (ignoring the fact that foo contains a list of values). This information is interesting and we'd like to keep it. But we'd also a way to get the distinct list of values.

Thanks!

If you want to do the array concatenation as is outlined in the other topic, it should be pretty easy to do using an Ingest node. There's a join processor built in that should do that for you. This will keep the records enriched as you index them as well, with no additional tooling (ie. Logstash).

If I recall correctly, you can reindex against an Ingest node too, but I may be wrong about that. I'm really versed on the reindex API.

1 Like

Thanks for the reply!

So the recommendation is to duplicate the data then? One string injected as a json array and one string injected as a single string with all the terms? Since we inject that attribute using a json array I thought both representation would be available somehow.

Yeah, duplicate data in JSON datastores is pretty common. As CJ pointed out, you can avoid that by using scripted fields, but I assume there's some added overhead since that field needs to be calculated with each query (though I also assume it's cached...).

If the field doesn't change, then duplicating it isn't a big deal. If it's something that DOES change, keeping it in sync is probably going to be a real pain, and you should stick with scripted fields.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.