Elasticsearch query on a nested field with condition

Elasticsearch v7.0

Hello and good day!

I'm trying to create a query that will have a condition: if a nested field has only 1 element, get that first element, if a nested field has 2 more or elements, get a matching nested field condition

Scenario:

I have an index named socialmedia and has a nested field named cms which places a sentiment for that document

An example document of the cms field looks like this

"_id" : 1,
"cms" : [
    {
      "cli_id" : 0,
      "cmx_sentiment" : "Negative"
    }
]

This cms field contains "cli_id" : 0 by default for its 1st element (this means it is for all the clients/users to see) but sooner or later, it goes like this:

"_id": 1,
"cms" : [
    {
      "cli_id" : 0,
      "cmx_sentiment" : "Negative"
    },
    {
      "cli_id" : 1,
      "cmx_sentiment" : "Positive"
    },
    {
      "cli_id" : 2,
      "cmx_sentiment" : "Neutral"
    },
]

The 2nd and 3rd element shows that the clients with cli_id equals to 1 and 2 has made a sentiment for that document.

Now, I want to formulate a query that if the client who logged in has no sentiment yet for a specific document, it fetches the cmx_sentiment that has the "cli_id" : 0

BUT , if the client who has logged in has a sentiment for the fetched documents according to his filters, the query will fetch the cmx_sentiment that has the matching cli_id of the logged in client

for example:
the client who has a cli_id of 2, will get the cmx_sentiment of **Neutral** according to the given document above

the client who has a cli_id of 5, will get the cmx_sentiment of **Negative** because he hasn't given a sentiment to the document

PSEUDO CODE :

If a document has a sentiment indicated by the client, get the cmx_sentiment of the cli_id == to the client's ID

if a document is fresh or the client HAS NOT labeled yet a sentiment on that document, get the element's cmx_sentiment that has cli_id == 0

I'm in need of a query to condition for the pseudo code above

Here's my sample query:

"aggs" => [
    "CMS" => [
        "nested" => [
            "path" => "cms",
        ],
        "aggs" => [
            "FILTER" => [
                "filter" => [
                    "bool" => [
                        "should" => [
                            [
                                "match" => [
                                    "cms.cli_id" => 0
                                ]
                            ],
                            [
                                "bool" => [
                                    "must" => [
                                        [
                                            // I'm planing to create a bool method here to test if cli_id is equalis to the logged-in client's ID
                                        ]
                                    ]
                                ]
                            ]
                        ]
                    ]
                ],
                "aggs"=> [
                    "TONALITY"=> [
                        "terms"=> [
                            "field" => "cms.cmx_sentiment"
                        ],
                    ]
                ]
            ]
        ]
    ]
]

Is my query correct?

The problem with the query I have provided, is that it SUMS all the elements, instead of picking one only

The query above provides this scenario:

The client with cli_id 2 logs in
Both the Neutral and Negative cmx_sentiment are being retrieved, instead of the Neutral alone

This use case is extremely close to the use case for script fields. The only problem is that script_fields might not be very efficient with nested datatypes.

I was able to write a scripted query that seems to do what you're looking for:

GET socialmedia/_search
{
  "script_fields": {
    "my_sentiment": {
      "script": {
        "source": """
          String default = 'Neutral';
          for (int i = 0; i < params['_source']['cms'].length; i++) {
            if (params['_source']['cms'][i]['cli_id'] == 0) {
              default = params['_source']['cms'][i]['cmx_sentiment'];
            }
            if (params['_source']['cms'][i]['cli_id'] == params['client_id']) {
              return params['_source']['cms'][i]['cmx_sentiment'];
            }
          }
          return default;
        """,
        "params": {
          "client_id": 2
        }
      }
    }
  }
}

If I were a little more familiar with the Painless scripting language, I probably could have made that look better. Anyway, the idea for script_fields is that you can do a little transformation on query results without going so deep into aggregations. And you can also parameterize these scripts, so you can plug in different client IDs.

The problem with what I've posted is that it might not perform well. As far as I can tell, the scripted query itself can't access the fields of nested documents because of the way the nested datatype is indexed into Lucene. So the script above takes the original document source and parses the JSON to get the data it needs. This might turn out to be manageable if you're only retrieving small numbers of documents, but it's hard to say.

Do you think this approach could work for your use case? There may be some other (or even better) approaches. This was just the first thing that occurred to me as I read your post.

-William

Hello sir WIlliam, thank you very much for your answer! gained a lot of insights about the "script" of ES.

However, is this applicable to aggs? 'Coz my aim is to create a report of doc_count to count the number of sentiments a client has for all the retrieved documents

It looks like I misunderstood your use case. How about this for a scripted solution, using a scripted metric aggregation for each sentiment?

First, we store a parameterized script for later use:

POST _scripts/client_sentiment
{
  "script": {
    "lang": "painless",
    "source": """
      String default = "";
      
      for (s in params['_source']['cms']) {
        
        // capture the "default" sentiment
        if (s['cli_id'] == 0) {
          default = s['cmx_sentiment'];
        }
        
        // if we find a sentment for our client, mark it here
        if (s['cli_id'] == params['client_id']) {
          if (s['cmx_sentiment'] == params['sentiment']) {
            state.counts.add(1);
          }
          return;
        }
      }
      
      // if our client hasn't shown up, check the default sentiment
      if (default == params['sentiment']) {
        state.counts.add(1);
      }
    """
  }
}

Now we can issue a query:

GET socialmedia/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "positive_sentiment": {
      "scripted_metric": {
        "init_script": "state.counts = []",
        "map_script": {
          "id": "client_sentiment"
        },
        "combine_script": "int total = 0; for (c in state.counts) { total += c} return total",
        "reduce_script": "int total = 0; for (a in states) { total += a } return total",
        "params": {
          "client_id": 2,
          "sentiment": "Positive"
        }
      }
    },
    "neutral_sentiment": {
      "scripted_metric": {
        "init_script": "state.counts = []",
        "map_script": {
          "id": "client_sentiment"
        },
        "combine_script": "int total = 0; for (c in state.counts) { total += c} return total",
        "reduce_script": "int total = 0; for (a in states) { total += a } return total",
        "params": {
          "client_id": 2,
          "sentiment": "Neutral"
        }
      }
    },
    "negative_sentiment": {
      "scripted_metric": {
        "init_script": "state.counts = []",
        "map_script": {
          "id": "client_sentiment"
        },
        "combine_script": "int total = 0; for (c in state.counts) { total += c} return total",
        "reduce_script": "int total = 0; for (a in states) { total += a } return total",
        "params": {
          "client_id": 2,
          "sentiment": "Negative"
        }
      }
    }
  }
}

My output includes the following:

{
  [...],
  "aggregations" : {
    "positive_sentiment" : {
      "value" : 0
    },
    "neutral_sentiment" : {
      "value" : 1
    },
    "negative_sentiment" : {
      "value" : 1
    }
  }
}

Is that the kind of report you're hoping to see? That is, for a given query, a count per sentiment for the given client?

Sorry for the confusion,

-William

I tried to find a script-less way of doing what you're looking for. It's not so hard to express the concept of "this client's ratings" for aggregation. The query only needs to find one nested document for that concept: that is, the client's rating. What's tricky is the concept of "the default ratings on documents that this client has not rated"; the query needs to access two nested documents. I haven't been able to figure out how to do the "join" within the same aggregation context.

You can get pretty close, however, with a separate aggregation for each concept:

GET socialmedia/_search
{
  "aggs": {
    "ratings_on_rated_docs": {
      "nested": { "path": "cms" },
      "aggs": {
        "client_ratings": {
          "filter": { "match": { "cms.cli_id": 1 } },
          "aggs": {
            "client_rating_terms": {
              "terms": {
                "field": "cms.cmx_sentiment.keyword"
              }
            }
          }
        }
      }
    },
    "docs_without_client_ratings": {
      "filter": {
        "bool": {
          "must_not": {
            "nested": {
              "path": "cms",
              "query": { "match": { "cms.cli_id": 1 } }
            }
          }
        }
      },
      "aggs": {
        "ratings_on_unrated_docs": {
          "nested": { "path": "cms" },
          "aggs": {
            "default_ratings": {
              "filter": { "match": { "cms.cli_id": 0 } },
              "aggs": {
                "default_rating_terms": {
                  "terms": {
                    "field": "cms.cmx_sentiment.keyword"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

I get this in my results:

{
  [...],
  "aggregations" : {
    "docs_without_client_ratings" : {
      "doc_count" : 1,
      "ratings_on_unrated_docs" : {
        "doc_count" : 1,
        "default_ratings" : {
          "doc_count" : 1,
          "default_rating_terms" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "Negative",
                "doc_count" : 1
              }
            ]
          }
        }
      }
    },
    "ratings_on_rated_docs" : {
      "doc_count" : 4,
      "client_ratings" : {
        "doc_count" : 1,
        "client_rating_terms" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "Positive",
              "doc_count" : 1
            }
          ]
        }
      }
    }
  }
}

There may be a way to get these counts in a single aggregation, but if so, I haven't been clever enough to find it. :frowning:

Some kind of script still seems to me like the natural way to get the query-time transformation you're looking for. I just worry a little about the performance.

-William

Woah! This is the best and closest ES query for me so far, and gets the right / exact output I'm looking for, I'm very grateful for this!

It's just that I want to perform the desired output in a single aggs because in this manner, I might have to sum up the 2 aggs called ratings_on_rated_docs and docs_without_client_ratings just to combine them... Hmmmm... still gonna dive into ES docs more and see if there would be a possible single aggs query

THANK YOU SO MUCH FOR THIS SIR @William_Brafford I FEEL THAT I'M ALREADY NEAR TO MY TARGET :smiley:

Just in case, I had a Stackoverflow inquiry about this: https://stackoverflow.com/questions/59574104/elasticsearch-query-on-a-nested-field-with-condition and had a discussion with another person: https://chat.stackoverflow.com/rooms/205405/discussion-between-suomynona-and-opster-es-ninja-nishant-and-suomynona

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.