Simulate NOT IN select


(Martin Duris) #1

Hi,
i need help how to create one search query in Elasticsearch.
Image we have index (login of clients) with 2 columns -> timestamp (when action happens) and user id.

What i would like to find what users have login after some date (for example this year) but not before this date (not previous years) .

In sql query would look like this:
SELECT userId
FROM A
WHERE A.timestamp > [DATE]
AND NOT IN (SELECT * FROM A WHERE A.timestamp < [DATE])

... and perfect would be, if returned 'userId' would be unique :slight_smile:

thanks a lot


(Christian Dahlqvist) #2

That requires a join between documents, which Elasticsearch does not support, so you may need to do it in two queries.


(Martin Duris) #3

Hi,
thanks for your response. Can you please help me build this two queries ? (or navigate me with some hints)

Thank you very much !


(Martin Duris) #4

Hi,
i have tried many queries (with compound queries https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html) but i have failed so far.

How i can search for result with two queries ?

Thanks a lot


(Christian Dahlqvist) #5

I was initially thinking about first getting all users since the cutoff date in one query and then lookup which of these that have been seen previously in another query. If the cutoff timestamp is not recent and the first query results in a lot of users, this will however be very expensive and inefficient.

The best way to get the new users since a specific date might be to create an entity-centric index. If you had a separate index with one record per user (user id as key), which contain the timestamp of the users first activity, you could easily use this to get new users after a specific timestamp..


(Martin Duris) #6

Hi,
entity-centric index is really good idea. But i would like also try option with "nested query".

I was try to create such query (first query to get users since cutoff date and lookup for users which have NOT been seen previously) -> but without no success. Can you give me any hint ?

thanks a lot


(Martin Duris) #7

Hi,
so i have prepared real example :slight_smile:

image following data :
CREATED | USER_ID
2017-01-01T15:40:00.000Z | 1
2017-01-01T15:40:00.000Z | 2
2017-02-01T15:40:00.000Z | 1
2017-02-01T15:40:00.000Z | 3

when i execute simple _search query (math all) -> you can check data here (https://pastebin.com/RgYfQq0d)

now when i execute following query :

  "query": {
    "bool": {
      "must": {
        "range": { "CREATED": { "gte": "2017-01-31T12:00:00.000Z"} }
      },
      "filter": {
        "bool": {
          "must_not": {
            "range": { "CREATED": { "lte": "2017-01-31T12:00:00.000Z"} }
          }  
        }
      }
    }
  },
  "_source": {
		"includes": [
			"CREATED",
			"USER_ID"
		],
		"excludes": []
	}

I would expect to return single instance USER_ID:3 ... but i have 3 and also 1

What am i doing wrong ?

thanks a lot


(Christian Dahlqvist) #8

How much data and users will you have and need to query across? The approach with two queries will probably only work for reasonably small datasets.


(Martin Duris) #9

Hi,
query should be executed (fortunately) over "subset" of all our data (in special index).

Every day there is increment of ~500 entries (and i have data for 3 years) -> so till today it is 600 000 entries.
Count of different users is less than 100 000.

Thanks


(Christian Dahlqvist) #10

How far back can the cut-off date go?


(Martin Duris) #11

Let say i will execute query at the beginning of each month -> cut-off date will be previous month.


(Martin Duris) #12

Hi,

so i have tried to do reindex and create new index where it will be only first login (_id will be USER_ID).

I have executed:

    POST _reindex
{
  "source": {
    "index": "mas-test-act-*"
  },
  "dest": {
    "index": "mas-act"
  },
  
  "script": {
    "inline": "ctx._id = ctx._source['USER_ID'];",
    "lang": "painless"
  }
}

Problem: for each user there is LAST login. Can i do reindex without updating existing documents ?

Thanks a lot


(Christian Dahlqvist) #13

I think you may need to do this through a scripted update as you want to set fields based on the content of the current document in the index, not just the one you are inserting. The Reindex API does however as far as I know not support scripted updates, so it may need to be done outside of Elasticsearch as outlined in this post on entity-centric indices.


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.