Elasticsearch query for `SELECT id FROM foo WHERE id NOT IN (SELECT id FROM foo WHERE ...)

I want to do a "not in present index" type of operation. For example, let's say I have an index called customer_subscription with just these 4 records:

| customer_id | pay_date   |
+-------------+------------+
|           1 | 2022-01-01 |
|           2 | 2022-01-01 |
|           1 | 2022-02-01 |
|           3 | 2022-02-01 |
+-------------+------------+

With SQL, I can run a command like this to see how many new customer_id were created after the month of January 2022:

SELECT * FROM customer_subscription
WHERE customer_id NOT IN (
  SELECT customer_id
  FROM customer_subscription
  WHERE pay_date >= '2022-01-01'
  AND pay_date < '2022-02-01'
)
AND pay_date = '2022-02-01';

This will return one result: customer_id:3, pay_date:'2022-02-01'.

It's not clear to me how to create an elasticsearch query that can use a not-in like operator on the customer_subscription index.

Can someone point me in the right direction or link the relevant documentation for me to study?


Note, I'm trying to learn pipeline aggregation because this might help me? Still trying to understand the details.

I don't think this query possible in a native Elasticsearch query, however, you might want to take a look at EQL, this might be able to achieve what you're looking to do. (I'm unfortunately not that familiar with EQL, so I won't be much help trying to use it, but its features seem like they might work here)

It's not pretty but I'm wondering if you couldn't use a bucket script aggregation if you convert the timestamps to epoch and use them as numbers in your calculation; maybe combined with a filter.

Bucket script aggregation | Elasticsearch Guide [8.6] | Elastic but this is definitely one of the more complex aggs to write :sweat_smile:

Thanks @xeraa

Do you think you can give a very high level pseudo code type of example? FOr example, I started with something like this:

// Insert 4 records for test purposes
POST customer_subscription/_bulk
{"index":{}}
{"customer_id":1,"pay_date":"2022-01-01"}
{"index":{}}
{"customer_id":2,"pay_date":"2022-01-01"}
{"index":{}}
{"customer_id":1,"pay_date":"2022-02-01"}
{"index":{}}
{"customer_id":3,"pay_date":"2022-02-01"}

// Attempt to use bucket_script
GET customer_subscription/_search
{
  "size": 0,
  "aggs": {
    "past_month": {
      "range": {
        "field": "pay_date",
        "ranges": [
          {
            "from": "2022-01-01",
            "to": "2022-02-01"
          }
        ]
      }
    },
    "current_month": {
      "range": {
        "field": "pay_date",
        "ranges": [
          {
            "from": "2022-02-01"
          }
        ]
      }
    },
    "new_customer_subscribers": {
      "bucket_script": {
        "buckets_path": {
          "past_month_bucket": "past_month",
          "current_month_bucket": "current_month"
        },
        // return list of customers that did not exist last month
        "script": "current_month_bucket.map(customer=>!past_month_bucket.includes(customer))"
      }
    }
  }
}

But my search query gave this error:

{
  "error": {
    "root_cause": [
      {
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: bucket_script aggregation [new_customer_subscribers] must be declared inside of another aggregation;"
      }
    ],
    "type": "action_request_validation_exception",
    "reason": "Validation Failed: 1: bucket_script aggregation [new_customer_subscribers] must be declared inside of another aggregation;"
  },
  "status": 400
}

If I attempt to put new_customer_subscriber inside another aggregation, then it seems new_customer_subscriber can't access the buckets for current_month and past_month.

I'm still learning about the various forms of aggregations, so I may have approached my problem with a completely incorrect mindset.

Here is a possible solution using a bucket selector. The query first aggregates by customer_id, then finds the earliest pay_date, then uses a bucket selector to find those from the selected month. It is probably possible to enhance the script to convert a date into an epoch, rather than specifying the epoch value.

GET /customer_subscription/_search
{
  "size": 0,
  "aggs": {
    "customers": {
      "terms": {
        "field": "customer_id",
        "size": 1000
      },
      "aggs": {
        "months_min": {
          "min": {
            "field": "pay_date"
          }
        },
        "new_customer_filter": {
          "bucket_selector": {
            "buckets_path": {
              "min_pay_date": "months_min"
            },
            "script": "params.min_pay_date >= 1643673600000L"
          }
        }        
      }
    }
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.