How to do a NOT IN like query in Elastic?

Hello

Is NOT-IN like query doable in Elastic?

Example, index has the following fields & data

_id, job_id, transaction, data_source
1, 123, read, RDBMS
2, 123, read, File
3, 123, write, File
4, 124, read, File
5, 124, export, RDBMS
6, 125, read, RDBMS
7, 126, export, RDBMS
8, 127, write, File

How to query records for jobs that do not include File read

SELECT * FROM ... WHERE job_id NOT IN (SELECT job_id FROM ... WHERE transaction='read' AND data_source='File')

Expected results -

_id, job_id, transaction, data_source
6, 125, read, RDBMS
7, 126, export, RDBMS
8, 127, write, File

Thanks

It's not possible to perform this kind of set-based operation.

If the SELECT statement inside of NOT IN(...) targets the same index, then this could be achieved with a bool query with two must_not clauses e.g.

{
  "query": {
    "bool": {
      "must_not": [
      {
        "term": {
          "transaction": {
            "value": "read"
          }
        }
      },
      {
        "term": {
          "data_source": {
            "value": "File"
          }
        }
      }]
    }
  }
}

which would be the equivalent of SQL

SELECT * 
FROM 
    jobs 
WHERE 
    transaction <> 'read' 
AND 
    data_source <> 'File'

Thank @forloop for the response

The SELECT statement inside of NOT IN targets the same index.

But the

SELECT *
FROM jobs
WHERE transaction <> 'read' AND data_source <> 'File'

would yield

_id, job_id, transaction, data_source
1, 123, read, RDBMS
3, 123, write, File
5, 124, export, RDBMS
6, 125, read, RDBMS
7, 126, export, RDBMS
8, 127, write, File

different from the desired

SELECT * FROM jobs WHERE job_id NOT IN (SELECT job_id FROM jobs WHERE transaction='read' AND data_source='File')

_id, job_id, transaction, data_source
6, 125, read, RDBMS
7, 126, export, RDBMS
8, 127, write, File

Ah I see, I missed the piece where the clauses are applied across job_id.

I don't believe there's an equivalent way to do this. Two ways that I would propose:

  1. Denormalizing the data to nest transaction and data_source tuples as a collection of nested objects mapped as nested types. Then use a nested query with similar logic as before to exclude documents.

or

  1. Perform two queries. First to retrieve the job_id that match transaction = read and data_source = File . Second, to query with the job_id applied in a terms query inside of a bool query must_not clause. Depending on the number of ids that we're dealing with, the queries may need to be further partitioned.

Both approaches have their tradeoffs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.