Simulate NOT IN select

Hi,
i need help how to create one search query in Elasticsearch.
Image we have index (login of clients) with 2 columns -> timestamp (when action happens) and user id.

What i would like to find what users have login after some date (for example this year) but not before this date (not previous years) .

In sql query would look like this:
SELECT userId
FROM A
WHERE A.timestamp > [DATE]
AND NOT IN (SELECT * FROM A WHERE A.timestamp < [DATE])

... and perfect would be, if returned 'userId' would be unique :slight_smile:

thanks a lot

That requires a join between documents, which Elasticsearch does not support, so you may need to do it in two queries.

Hi,
thanks for your response. Can you please help me build this two queries ? (or navigate me with some hints)

Thank you very much !

Hi,
i have tried many queries (with compound queries https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) but i have failed so far.

How i can search for result with two queries ?

Thanks a lot

I was initially thinking about first getting all users since the cutoff date in one query and then lookup which of these that have been seen previously in another query. If the cutoff timestamp is not recent and the first query results in a lot of users, this will however be very expensive and inefficient.

The best way to get the new users since a specific date might be to create an entity-centric index. If you had a separate index with one record per user (user id as key), which contain the timestamp of the users first activity, you could easily use this to get new users after a specific timestamp..

Hi,
entity-centric index is really good idea. But i would like also try option with "nested query".

I was try to create such query (first query to get users since cutoff date and lookup for users which have NOT been seen previously) -> but without no success. Can you give me any hint ?

thanks a lot

Hi,
so i have prepared real example :slight_smile:

image following data :
CREATED | USER_ID
2017-01-01T15:40:00.000Z | 1
2017-01-01T15:40:00.000Z | 2
2017-02-01T15:40:00.000Z | 1
2017-02-01T15:40:00.000Z | 3

when i execute simple _search query (math all) -> you can check data here (https://pastebin.com/RgYfQq0d)

now when i execute following query :

  "query": {
    "bool": {
      "must": {
        "range": { "CREATED": { "gte": "2017-01-31T12:00:00.000Z"} }
      },
      "filter": {
        "bool": {
          "must_not": {
            "range": { "CREATED": { "lte": "2017-01-31T12:00:00.000Z"} }
          }  
        }
      }
    }
  },
  "_source": {
		"includes": [
			"CREATED",
			"USER_ID"
		],
		"excludes": []
	}

I would expect to return single instance USER_ID:3 ... but i have 3 and also 1

What am i doing wrong ?

thanks a lot

How much data and users will you have and need to query across? The approach with two queries will probably only work for reasonably small datasets.

Hi,
query should be executed (fortunately) over "subset" of all our data (in special index).

Every day there is increment of ~500 entries (and i have data for 3 years) -> so till today it is 600 000 entries.
Count of different users is less than 100 000.

Thanks

How far back can the cut-off date go?

Let say i will execute query at the beginning of each month -> cut-off date will be previous month.

Hi,

so i have tried to do reindex and create new index where it will be only first login (_id will be USER_ID).

I have executed:

    POST _reindex
{
  "source": {
    "index": "mas-test-act-*"
  },
  "dest": {
    "index": "mas-act"
  },
  
  "script": {
    "inline": "ctx._id = ctx._source['USER_ID'];",
    "lang": "painless"
  }
}

Problem: for each user there is LAST login. Can i do reindex without updating existing documents ?

Thanks a lot

I think you may need to do this through a scripted update as you want to set fields based on the content of the current document in the index, not just the one you are inserting. The Reindex API does however as far as I know not support scripted updates, so it may need to be done outside of Elasticsearch as outlined in this post on entity-centric indices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.