Using post filter to remove entries whereby collapsed inner hits total is less than X

I have a query that collapses on a field representing a hash that can at most be shared between two entries. What I need to do is via a post filter (or alternative) remove the results from the final list whereby the inner hits total is 1 and not 2, however post filter can not find the inner hits for each entry and hence the total is not available.

What would be the best way to filter results whereby they only have one entry within the collapse inner hits ?

Elastic version : 7.10

Example query below...

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "exists": {
                  "field": "someHash"
                }
              }
            ]
          }
        },
        {
          // ... other must items ... //
        }
      ],
      "filter": [
        // ... filter items ... //
      ],
      "must_not": [
        // ... must not items ... //
      ]
    }
  },
  "collapse": {
    "field": "someHash",
    "inner_hits": {
      "name": "same_hash",
      "size": 2
    }
  },
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "someHash",
        "precision_threshold": 10000
      }
    }
  },
  "sort": [
    {
      "entityId": "desc"
    }
  ],
  "post_filter": {
    "bool": {
      "filter": [
        {
          "term": {
            "inner_hits.same_hash.hits.total": {
              "value": 2
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 100
}

Hi @codeMonkey82,

Thanks for raising your issue. It looks like it might be similar to the below issue:

Can you give the nested/ inner_hits option a try and let us know if that resolves your issue?

@carly.richmond Unfortunately I did try this but it did not work...

I tried the following query...

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "must": [
              {
                "exists": {
                  "field": "someHash"
                }
              }
            ]
          }
        },
        {
          // ... other must items ... //
        }
      ],
      "filter": [
        // ... filter items ... //
      ],
      "must_not": [
        // ... must not items ... //
      ]
    }
  },
  "collapse": {
    "field": "someHash",
    "inner_hits": {
      "name": "same_hash",
      "size": 2
    }
  },
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "someHash",
        "precision_threshold": 10000
      }
    }
  },
  "sort": [
    {
      "entityId": "desc"
    }
  ],
  "post_filter": {
    "nested": {
      "path": "same_hash",
      "inner_hits": {},
      "query": {
        "term": {
          "hits.total": {
            "value": 2
          }
        }
      }
    }
  },
  "from": 0,
  "size": 100
}

The error response was...

failed to create query: [nested] failed to find nested object under path [same_hash]

FYI it is not the contents of the inner hits I am interested in but rather the total which will allow me to filter out the parent for less than two inner hits

Thanks for confirming. Can you share the mapping for your index please? Also which version of Elasticsearch are you using?

So I pulled one of the mappings but had to strip out certain elements for security reasons but fundamentally the contents is wrapped the same. I am using Elasticsearch v7.10.

{
  "index_2024_06": {
      "mappings": {
          "dynamic": "false",
          "properties": {
              "hasTask": {
                  "type": "boolean"
              },
              "someHash": {
                  "type": "keyword"
              },
              "entityId": {
                  "type": "keyword"
              }
          }
      }
  }
}

@carly.richmond any joy on this one, really need a solution that works or we might need to refactor our approach around the use of collapse.

Sorry I've not had time to have a play yet to figure it out. It looks like the post_filter approach in the referenced post is available from 7.10.

I see the type of someHash is keyword rather than nested which surprised me slightly. Can you give me an example of the data structure (sanitized obviously!)?

Nested is only used for arrays to allow them to be searched independently but what I am trying to do is query the inner_hits total value which happens as part of the collapse. If I were able to query this total (inner_hits.same_hash.hits.total.value) then I would be able to filter out what I need but I am unable to do this. It seems that inner_hits is not available to the post_filter in this way. So I need a solution to the above.

I suppose the question first is does the post filter run against the results post collapse ?

Thanks for confirming @codeMonkey82 . I've not been able to get the post_filter to work either. But I was wondering if you had considered using a terms aggregation combined with a bucket_selector to filter the counts instead of using collapse?

The solution needs to take into pagination and sorting of the final result set so using buckets might not work as expected, I do appreciate the response thought so thank you