Elasticsearch query for `SELECT id FROM foo WHERE id NOT IN (SELECT id FROM foo WHERE ...)

learningelastic · February 10, 2023, 10:39pm

I want to do a "not in present index" type of operation. For example, let's say I have an index called customer_subscription with just these 4 records:

| customer_id | pay_date   |
+-------------+------------+
|           1 | 2022-01-01 |
|           2 | 2022-01-01 |
|           1 | 2022-02-01 |
|           3 | 2022-02-01 |
+-------------+------------+

With SQL, I can run a command like this to see how many new customer_id were created after the month of January 2022:

SELECT * FROM customer_subscription
WHERE customer_id NOT IN (
  SELECT customer_id
  FROM customer_subscription
  WHERE pay_date >= '2022-01-01'
  AND pay_date < '2022-02-01'
)
AND pay_date = '2022-02-01';

This will return one result: customer_id:3, pay_date:'2022-02-01'.

It's not clear to me how to create an elasticsearch query that can use a not-in like operator on the customer_subscription index.

Can someone point me in the right direction or link the relevant documentation for me to study?

Note, I'm trying to learn pipeline aggregation because this might help me? Still trying to understand the details.

BenB196 · February 11, 2023, 2:16pm

I don't think this query possible in a native Elasticsearch query, however, you might want to take a look at EQL, this might be able to achieve what you're looking to do. (I'm unfortunately not that familiar with EQL, so I won't be much help trying to use it, but its features seem like they might work here)

xeraa · February 13, 2023, 4:12pm

It's not pretty but I'm wondering if you couldn't use a bucket script aggregation if you convert the timestamps to epoch and use them as numbers in your calculation; maybe combined with a filter.

Bucket script aggregation | Elasticsearch Guide [8.6] | Elastic but this is definitely one of the more complex aggs to write

learningelastic · February 13, 2023, 7:05pm

Thanks @xeraa

Do you think you can give a very high level pseudo code type of example? FOr example, I started with something like this:

// Insert 4 records for test purposes
POST customer_subscription/_bulk
{"index":{}}
{"customer_id":1,"pay_date":"2022-01-01"}
{"index":{}}
{"customer_id":2,"pay_date":"2022-01-01"}
{"index":{}}
{"customer_id":1,"pay_date":"2022-02-01"}
{"index":{}}
{"customer_id":3,"pay_date":"2022-02-01"}

// Attempt to use bucket_script
GET customer_subscription/_search
{
  "size": 0,
  "aggs": {
    "past_month": {
      "range": {
        "field": "pay_date",
        "ranges": [
          {
            "from": "2022-01-01",
            "to": "2022-02-01"
          }
        ]
      }
    },
    "current_month": {
      "range": {
        "field": "pay_date",
        "ranges": [
          {
            "from": "2022-02-01"
          }
        ]
      }
    },
    "new_customer_subscribers": {
      "bucket_script": {
        "buckets_path": {
          "past_month_bucket": "past_month",
          "current_month_bucket": "current_month"
        },
        // return list of customers that did not exist last month
        "script": "current_month_bucket.map(customer=>!past_month_bucket.includes(customer))"
      }
    }
  }
}

But my search query gave this error:

{
  "error": {
    "root_cause": [
      {
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: bucket_script aggregation [new_customer_subscribers] must be declared inside of another aggregation;"
      }
    ],
    "type": "action_request_validation_exception",
    "reason": "Validation Failed: 1: bucket_script aggregation [new_customer_subscribers] must be declared inside of another aggregation;"
  },
  "status": 400
}

If I attempt to put new_customer_subscriber inside another aggregation, then it seems new_customer_subscriber can't access the buckets for current_month and past_month.

I'm still learning about the various forms of aggregations, so I may have approached my problem with a completely incorrect mindset.

markbyrnegkc · February 14, 2023, 7:51pm

Here is a possible solution using a bucket selector. The query first aggregates by customer_id, then finds the earliest pay_date, then uses a bucket selector to find those from the selected month. It is probably possible to enhance the script to convert a date into an epoch, rather than specifying the epoch value.

GET /customer_subscription/_search
{
  "size": 0,
  "aggs": {
    "customers": {
      "terms": {
        "field": "customer_id",
        "size": 1000
      },
      "aggs": {
        "months_min": {
          "min": {
            "field": "pay_date"
          }
        },
        "new_customer_filter": {
          "bucket_selector": {
            "buckets_path": {
              "min_pay_date": "months_min"
            },
            "script": "params.min_pay_date >= 1643673600000L"
          }
        }        
      }
    }
  }
}

system · March 14, 2023, 7:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SQL Queries Elasticsearch	2	352	October 3, 2019
Simulate NOT IN select Elasticsearch	13	2796	July 2, 2017
Is there any solution to do the “NOT IN” functionality in Elasticsearch? Elasticsearch	4	455	July 6, 2017
SQL Query with "NOT IN" keyword replacement in ES Elasticsearch	6	641	January 28, 2019
How to do a NOT IN like query in Elastic? Elasticsearch	4	13944	September 5, 2018

Elasticsearch query for `SELECT id FROM foo WHERE id NOT IN (SELECT id FROM foo WHERE ...)

Related topics