Discrepancy in using Top Hits with Query

Note - I moved all the results to Pastebin, due to the word count limit of this site, sorry for the inconvenience.

When I am running the following query, for grouping by creator_cust_id in a specific date range:

{  
   "from":0,
   "size":0,
   "query":{  
      "bool":{  
         "must":[  
            {  
               "range":{  
                  "lead_created_time":{  
                     "from":"2019-02-01 00:00:00",
                     "to":"2019-03-01 00:00:00",
                     "format":"yyyy-MM-dd HH:mm:ss",
                     "boost":1.0
                  }
               }
            }
         ],
         "adjust_pure_negative":true,
         "boost":1.0
      }
   },
   "aggregations":{  
      "creator_cust_id":{  
         "terms":{  
            "field":"creator_cust_id.keyword",
            "size":1000,
            "min_doc_count":1,
            "shard_min_doc_count":0,
            "show_term_doc_count_error":false,
            "order":[  
               {  
                  "_count":"desc"
               },
               {  
                  "_key":"asc"
               }
            ]
         },
         "aggregations":{  
            "data_fields":{  
               "top_hits":{  
                  "from":0,
                  "size":1,
                  "version":false,
                  "explain":false,
                  "_source":{  
                     "includes":[  
                        "creator.agent_name",
                        "creator.agent_team"
                     ],
                     "excludes":[  

                     ]
                  }
               }
            }
         }
      }
   }
}

I get the proper result with all the desired data (agent_name, agent_team), which is: https://pastebin.com/UgFeeK1a

But, when I add a query to get the specific creator_cust_id, I get proper data in one case (1000013335) and don't get the data in another case (1107221321).

Here are the respective queries with the result:

Getting desired data

{
  "from": 0,
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "lead_created_time": {
              "from": "2019-02-01 00:00:00",
              "to": "2019-03-01 00:00:00",
              "format": "yyyy-MM-dd HH:mm:ss",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "creator_cust_id": {
              "query": "1000013335",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "aggregations": {
    "creator_cust_id": {
      "terms": {
        "field": "creator_cust_id.keyword",
        "size": 1000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "data_fields": {
          "top_hits": {
            "from": 0,
            "size": 1,
            "version": false,
            "explain": false,
            "_source": {
              "includes": [
                "creator.agent_name",
                "creator.agent_team"
              ],
              "excludes": []
            }
          }
        }
      }
    }
  }
}

Result : https://pastebin.com/AwJxLb91

Not getting desired data

{
  "from": 0,
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "lead_created_time": {
              "from": "2019-02-01 00:00:00",
              "to": "2019-03-01 00:00:00",
              "format": "yyyy-MM-dd HH:mm:ss",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "creator_cust_id": {
              "query": "1107221321",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "aggregations": {
    "creator_cust_id": {
      "terms": {
        "field": "creator_cust_id.keyword",
        "size": 1000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "data_fields": {
          "top_hits": {
            "from": 0,
            "size": 1,
            "version": false,
            "explain": false,
            "_source": {
              "includes": [
                "creator.agent_name",
                "creator.agent_team"
              ],
              "excludes": []
            }
          }
        }
      }
    }
  }
}

Result : https://pastebin.com/Pp9sMTy3

Notice the empty source field here.

I am not able to understand, why I am getting such a discrepancy, as I have used the query in a lot of aggregations, without any error till now.

I tried it by using Filter Aggregation and it worked properly, but I can't use Filter Aggregation in my case. What's the logic behind this difference? How can I solve it using query?

I know it won't make any difference, but I am using Java High-Level Rest Client, and these queries are made using that only.

Do ask in the comments for any information that is missing, or need explanation.

I think it would be interesting to see what is the output of those queries without the aggregations. They might give light to what you are seeing.

Thanks for your reply, I will check them out and update with the same!

Here are the queries and respective result:

Query for 1000013335 : https://pastebin.com/GTLmySZx
Result : https://pastebin.com/bzDsQTim

Query for 1107221321 : https://pastebin.com/ghdA0nuJ
Result : https://pastebin.com/bzhSPMya

As you see in the result of your second query, the first element does not contain a creator, maybe it is missing in your data. The top hits aggregator is taking the value for that element, therefore it is empty.

Then why does the data show up, when I am using the general aggregation (without any creator_cust_id query)?

Your query is filtering by range:

"range": {
            "lead_created_time": {
              "from": "2019-02-01 00:00:00",
              "to": "2019-03-01 00:00:00",
              "format": "yyyy-MM-dd HH:mm:ss",
              "boost": 1
            }
          }

and by creator_cust_id:

 "match": {
            "creator_cust_id": {
              "query": "1107221321",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }

But it is not filtering out elements that do not have:

creator.agent_name
creator.agent_team

So the document matches the query but it contains an empty creator object. Then the top level aggregation uses creator_cust_id.keyword and then you have the top hits aggregator. The result is what it is expected, there is no place where you exclude documents with empty creator object.

Yeah I got the answer I was looking for, thanks for your help. Wanted to ask one thing, is there a way to get results where agent_name and agent_team are non empty? One way I think of is with using a mixture of aggregation of creator_cust_id + agent_name. Is there a better way to do it?

I guess you can you the exists query to filter out those documents:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html

If you want to be part of your query but filter them from the aggregation you might be able to do it with the filter aggregation:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.