Match entities where field with specific value does not exists

farkasaba · August 10, 2021, 9:32am

Hello there,

I have my documents indexed in the following format:

{
...
"special_id"=1,
"phase"="create"
},
{
...
"special_id"=1,
"phase"="update"
},
{
...
"special_id"=2,
"phase"="update"
},
{
...
"special_id"=3,
"phase"="create"
},
{
...
"special_id"=3,
"phase"="update"
}

I would like to obtain a list of special_id s, where there are no documents with "phase" = "create". Is there a way to obtain such list without external scripts?
Thanks!
Aba

farkasaba · August 10, 2021, 11:38am

What I've tried so far:

GET my_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 10
      },
      "aggs": {
        "phasetype": {
          "terms": {
            "field": "phase",
            "size": 10
          }
        }
      }
    }
  }
}

What I don't know how to do, is to extract the values, where "doc_count" = 0 for the "key" = "create". I think it would be enough if I could sort by "All_IDs.phasetype.doc_count", but I have no idea how.

farkasaba · August 10, 2021, 7:46pm

I made some progress in the meantime:

POST my_index/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "terms": {
            "special_id": ["423580","423581","423582"]
          }
        }
      ]
    }
  },
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 10000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {"match": {"phase": "create"}},
              "phase_update": {"match": {"phase": "update"}}
            }
          },
          "aggs": {
            "my_bucket_selector": {
              "bucket_selector": {
                "buckets_path": {
                  "phaseCreate": "_count"
                },
                "script": "params.phaseCreate == 0"
              }
            }
          }
        }
      }
    }
  }
}

There is still a problem with this approach. I want to have only those values in the result, where the doc_count = 0, but this one still prints everything. See below result:

"aggregations" : {
    "All_IDs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "423580",
          "doc_count" : 16,
          "filter_thing" : {
            "buckets" : {
              "phase_create" : {
                "doc_count" : 0
              }
            }
          }
        },
        {
          "key" : "423582",
          "doc_count" : 16,
          "filter_thing" : {
            "buckets" : { }
          }
        },
        {
          "key" : "423581",
          "doc_count" : 6,
          "filter_thing" : {
            "buckets" : {
              "phase_create" : {
                "doc_count" : 0
              }
            }
          }
        }
      ]
    }
  }

For this specific set of special_id s, 423582 should not appear. How can I achieve this?

BenTrent · August 10, 2021, 8:09pm

Here is a minimal repeatable example. This will filter out special_id buckets based on the count of values that match the create phase. Specifically, when count is 0

# Create the index
PUT test_index
{
  "mappings": {
    "properties": {
      "special_id": {
        "type": "long"
      },
      "phase": {
        "type": "keyword"
      }
    }
  }
}

# Index some test data
POST test_index/_bulk?refresh=true
{"index":{}}
{"special_id":1,"phase":"create"}
{"index":{}}
{"special_id":1,"phase":"update"}
{"index":{}}
{"special_id":2,"phase":"update"}
{"index":{}}
{"special_id":3,"phase":"create"}
{"index":{}}
{"special_id":3,"phase":"update"}

# Search it
POST test_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "include": [
          1,
          2
        ],
        "size": 10000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {
                "match": {
                  "phase": "create"
                }
              },
              "phase_update": {
                "match": {
                  "phase": "update"
                }
              }
            }
          },
          "aggs": {
            "count": {
              "value_count": {
                "field": "special_id"
              }
            }
          }
        },
        "select": {
          "bucket_selector": {
            "buckets_path": {
              "c": "filter_thing['phase_create']>count"
            },
            "script": "params.c==0"
          }
        }
      }
    }
  }
}

Please note:

Terms aggs supports the include parameter. This is much more performant than adding your own top level filter if you are only filtering on those term values
bucket_selector is a sub agg for the ALL_IDs and is filtering based on the sub-agg within the phase_createaggregation

Honestly, it seems to me that the _count path should work with filters aggregations. I am not 100% sure why it doesn't but that smells like a bug to me.

farkasaba · August 10, 2021, 8:52pm

In the meantime I also came up with a very complicated solution, which seems to work, but yours is looking better. Here is mine:

POST my_index/_search
{
  "size": 0,
  "aggs": {
    "All_IDs": {
      "terms": {
        "field": "special_id",
        "size": 20000
      },
      "aggs": {
        "filter_thing": {
          "filters": {
            "filters": {
              "phase_create": {"match": {"phase": "create"}}
            }
          },
          "aggs": {
            "my_bucket_selector": {
              "bucket_selector": {
                "buckets_path": {
                  "phaseCreate": "_count"
                },
                "script": "params.phaseCreate == 0"
              }
            }
          }
        },
        "min_bucket_count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "filter_thing._bucket_count"
            },
            "script": "params.count != 0"
          }
        }
      }
    }
  }
}

I removed the "phaseUpdate" filter, so I can use the _count and I added a second bucket_selector to filter out the empty buckets.

Anyhow, your solution looks better. Thank you very much for all your help!

Now the only problem which remains to be solved is to use this _search on all the IDs (there are approximately 500000), and all the documents (approx 10million documents). I think I should split it somehow by special_id intervals. The only issue is, that special_id is really special. It is keyword .

Mark_Harwood · August 11, 2021, 8:32am

Any system that tries to get a holistic view of each entity is going to struggle if the related data for each entity is distributed somewhat randomly across multiple machines.

The terms aggregation is useful for grouping together related items under a common key but will fail to do so reliably if there are very large numbers of keys spread across multiple machines. Not all data nodes may agree exactly on which keys to gather for final analysis on the search coordinating node, meaning some data may be missing for keys presented in the final result.

Your sort of behavioural analysis looks like a classic use case for an entity centric index created using the transforms api.

farkasaba · August 11, 2021, 9:05am

My data is not yet distributed across multiple machines. This is index has only 1 shard.

GET _cat/shards/restored_prod_event_store20210729?v

index                             shard prirep state        docs  store ip          node
my_index 0     p      STARTED 450548050 37.9gb 123.123.123.123 test-node-3

Sidenote: The doc count is 450548050, however the number of documents I am interested in is actually smaller. (I am also filtering based on another field.)
See below result for the search call written above in the earlier posts:

{
  "took" : 431,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "All_IDs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 9846594,
      "buckets" : [ ]
    }
  }
}

But I understand what you are saying, the "relation" : "gte"is clearly not ok for me, and the "sum_other_doc_count" : 9846594 is also way too high. I will review the transforms api as you suggested, and I hope I can figure something out. Any other tips which can help me on the way are very much appreciated. Thank you for your help!

Mark_Harwood · August 11, 2021, 9:18am

What you may find is that by transforming low-level log records into entities you can do a lot more interesting analysis e.g. average length of time taken for different phases, max number of phase changes per entity etc.
Derived entity attributes like "duration" or "lastKnownState" are typically useful things the transform API can help with. It shifts the computational cost and complexity of doing this sort of analysis from query-time to index-time.

system · September 8, 2021, 9:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic query with filters Elasticsearch	3	499	October 29, 2018
Highlight events with only one value for a field Elasticsearch	14	522	June 18, 2019
Find missing documents in an index Elasticsearch	1	781	July 6, 2017
How to get aggregations with buckets of zero count only for the filter provided Elasticsearch	1	2394	July 24, 2020
Filter based on the doc_count with aggregations Elasticsearch	2	16803	July 5, 2017

Match entities where field with specific value does not exists

Related topics