Match entities where field with specific value does not exists

Hello there,

I have my documents indexed in the following format:

{
...
"special_id"=1,
"phase"="create"
},
{
...
"special_id"=1,
"phase"="update"
},
{
...
"special_id"=2,
"phase"="update"
},
{
...
"special_id"=3,
"phase"="create"
},
{
...
"special_id"=3,
"phase"="update"
}

I would like to obtain a list of special_id s, where there are no documents with "phase" = "create". Is there a way to obtain such list without external scripts?
Thanks!
Aba

What I've tried so far:

GET my_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 10
      },
      "aggs": {
        "phasetype": {
          "terms": {
            "field": "phase",
            "size": 10
          }
        }
      }
    }
  }
}

What I don't know how to do, is to extract the values, where "doc_count" = 0 for the "key" = "create". I think it would be enough if I could sort by "All_IDs.phasetype.doc_count", but I have no idea how.

I made some progress in the meantime:

POST my_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "special_id": ["423580","423581","423582"]
          }
        }
      ]
    }
  },
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 10000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {"match": {"phase": "create"}},
              "phase_update": {"match": {"phase": "update"}}
            }
          },
          "aggs": {
            "my_bucket_selector": {
              "bucket_selector": {
                "buckets_path": {
                  "phaseCreate": "_count"
                },
                "script": "params.phaseCreate == 0"
              }
            }
          }
        }
      }
    }
  }
}

There is still a problem with this approach. I want to have only those values in the result, where the doc_count = 0, but this one still prints everything. See below result:

"aggregations" : {
    "All_IDs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "423580",
          "doc_count" : 16,
          "filter_thing" : {
            "buckets" : {
              "phase_create" : {
                "doc_count" : 0
              }
            }
          }
        },
        {
          "key" : "423582",
          "doc_count" : 16,
          "filter_thing" : {
            "buckets" : { }
          }
        },
        {
          "key" : "423581",
          "doc_count" : 6,
          "filter_thing" : {
            "buckets" : {
              "phase_create" : {
                "doc_count" : 0
              }
            }
          }
        }
      ]
    }
  }

For this specific set of special_id s, 423582 should not appear. How can I achieve this?

Here is a minimal repeatable example. This will filter out special_id buckets based on the count of values that match the create phase. Specifically, when count is 0

# Create the index
PUT test_index
{
  "mappings": {
    "properties": {
      "special_id": {
        "type": "long"
      },
      "phase": {
        "type": "keyword"
      }
    }
  }
}

# Index some test data
POST test_index/_bulk?refresh=true
{"index":{}}
{"special_id":1,"phase":"create"}
{"index":{}}
{"special_id":1,"phase":"update"}
{"index":{}}
{"special_id":2,"phase":"update"}
{"index":{}}
{"special_id":3,"phase":"create"}
{"index":{}}
{"special_id":3,"phase":"update"}

# Search it
POST test_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "include": [
          1,
          2
        ],
        "size": 10000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {
                "match": {
                  "phase": "create"
                }
              },
              "phase_update": {
                "match": {
                  "phase": "update"
                }
              }
            }
          },
          "aggs": {
            "count": {
              "value_count": {
                "field": "special_id"
              }
            }
          }
        },
        "select": {
          "bucket_selector": {
            "buckets_path": {
              "c": "filter_thing['phase_create']>count"
            },
            "script": "params.c==0"
          }
        }
      }
    }
  }
}

Please note:

  • Terms aggs supports the include parameter. This is much more performant than adding your own top level filter if you are only filtering on those term values
  • bucket_selector is a sub agg for the ALL_IDs and is filtering based on the sub-agg within the phase_createaggregation

Honestly, it seems to me that the _count path should work with filters aggregations. I am not 100% sure why it doesn't but that smells like a bug to me.

In the meantime I also came up with a very complicated solution, which seems to work, but yours is looking better. Here is mine:

POST my_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 20000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {"match": {"phase": "create"}}
            }
          },
          "aggs": {
            "my_bucket_selector": {
              "bucket_selector": {
                "buckets_path": {
                  "phaseCreate": "_count"
                },
                "script": "params.phaseCreate == 0"
              }
            }
          }
        },
        "min_bucket_count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "filter_thing._bucket_count"
            },
            "script": "params.count != 0"
          }
        }
      }
    }
  }
}

I removed the "phaseUpdate" filter, so I can use the _count and I added a second bucket_selector to filter out the empty buckets.

Anyhow, your solution looks better. Thank you very much for all your help!

Now the only problem which remains to be solved is to use this _search on all the IDs (there are approximately 500000), and all the documents (approx 10million documents). I think I should split it somehow by special_id intervals. The only issue is, that special_id is really special. It is keyword :frowning: .

Any system that tries to get a holistic view of each entity is going to struggle if the related data for each entity is distributed somewhat randomly across multiple machines.

The terms aggregation is useful for grouping together related items under a common key but will fail to do so reliably if there are very large numbers of keys spread across multiple machines. Not all data nodes may agree exactly on which keys to gather for final analysis on the search coordinating node, meaning some data may be missing for keys presented in the final result.

Your sort of behavioural analysis looks like a classic use case for an entity centric index created using the transforms api.

My data is not yet distributed across multiple machines. This is index has only 1 shard.

GET _cat/shards/restored_prod_event_store20210729?v

index                             shard prirep state        docs  store ip          node
my_index 0     p      STARTED 450548050 37.9gb 123.123.123.123 test-node-3

Sidenote: The doc count is 450548050, however the number of documents I am interested in is actually smaller. (I am also filtering based on another field.)
See below result for the search call written above in the earlier posts:

{
  "took" : 431,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "All_IDs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 9846594,
      "buckets" : [ ]
    }
  }
}

But I understand what you are saying, the "relation" : "gte"is clearly not ok for me, and the "sum_other_doc_count" : 9846594 is also way too high. I will review the transforms api as you suggested, and I hope I can figure something out. Any other tips which can help me on the way are very much appreciated. Thank you for your help!

What you may find is that by transforming low-level log records into entities you can do a lot more interesting analysis e.g. average length of time taken for different phases, max number of phase changes per entity etc.
Derived entity attributes like "duration" or "lastKnownState" are typically useful things the transform API can help with. It shifts the computational cost and complexity of doing this sort of analysis from query-time to index-time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.