I'm going to suggest you denormalize your documents, and allow the system to assign document IDs. You can retain your mapping, but the group_id
field will become multivalued, so you will have a single document per user. (If you still strongly prefer to use an assigned document ID, just use the user_id
.
By doing this, meeting the search requirements you present is relatively simple.
Here I create the index:
PUT intersection
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1
},
"mappings": {
"properties": {
"group_id": {
"type": "long"
},
"user_id": {
"type": "long"
}
}
}
}
Then I will index a few user documents
POST intersection/_doc
{"user_id":1,"group_id":[1,2,3,4,5]}
POST intersection/_doc
{"user_id":2,"group_id":[1,3,5]}
POST intersection/_doc
{"user_id":3,"group_id":[1,4,5]}
POST intersection/_doc
{"user_id":4,"group_id":[2,5]}
POST intersection/_doc
{"user_id":5,"group_id":[2,5]}
POST intersection/_doc
{"user_id":6,"group_id":[1,2,6,7,8,9]}
From this group of users, if we are told "find all the users who are exclusively in group 1 or group 1", we expect user_ids 2, 3, 4, and 5.
Notice in particular I put user 6 into groups 6, 7, 8, and 9 all alone. We will use this later to help confirm we are querying properly.
Let me show you just the query snippet to accomplish that (I hope it's clear in isolation):
"query": {
"bool": {
"should": [
{
"bool": {
"must": [{"match": {"group_id": 1}}],
"must_not": [{"match": {"group_id": 2}}]
}
},
{
"bool": {
"must": [{"match": {"group_id": 2}}],
"must_not": [{"match": {"group_id": 1}}]
}
}
]
}
},
Now, given that list of user documents, we want to get a count of how many of them there are in all other groups. For this, I'm going to apply a simple terms aggregation on the group_id
field.
"aggs": {
"unique": {
"terms": {
"field": "group_id",
"size": 10
}
}
}
Put those together:
POST intersection/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"bool": {
"must": [{"match": {"group_id": 1}}],
"must_not": [{"match": {"group_id": 2}}]
}
},
{
"bool": {
"must": [{"match": {"group_id": 2}}],
"must_not": [{"match": {"group_id": 1}}]
}
}
]
}
},
"aggs": {
"unique": {
"terms": {
"field": "group_id",
"size": 10
}
}
}
}
The result is then:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"unique" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 5,
"doc_count" : 4
},
{
"key" : 1,
"doc_count" : 2
},
{
"key" : 2,
"doc_count" : 2
},
{
"key" : 3,
"doc_count" : 1
},
{
"key" : 4,
"doc_count" : 1
}
]
}
}
}
Notice the hits
list is empty, that's because I asked for size: 0
, as I'm not really interested in the whole documents. I do see that hits.total.value
is 4, which we expect: the group-1 XOR group-2 should match exactly 4 users.
Notice also that the aggregation doesn't have any buckets for 6, 7, 8, or 9, so we know for sure user 6 didn't get selected.