Intersection between documents by specific property

I've got index with the following mapping:

"properties": {
    "group_id": {
        "type": "long"
    },
    "user_id": {
        "type": "long"
    }
 }

Document id has the following structure {group_id}-{user_id}.

Currently I've got 26 million documents in this index. One group can have more than 150 000 users.

I've got the following requirement:

We've got array of group_ids as an input. I have to find unique users, associated with the given groups and search them in the other groups.

Example:

Input: group_ids - [1, 2]

Output:

Let say groups 1 and 2 have 100 000 distinct users and 75 000 of them are in group 3, 50 000 of them are in group 4 and 25 000 in group 5, I have to return the following:

group 3: 75 000 users (75%)
group 4: 50 000 users (50%)
group 5: 25 000 users (25%)

I've wrote a query to select all distinct user_ids, associated to the groups from the input. Then I create another query and use them as terms filter to get array of groups, related to the same users.

But this query doesn't work properly because of the following limitation:

By default, Elasticsearch limits the terms query to a maximum of 65,536 terms. You can change this limit using the index.max_terms_count setting.

And even if I change max_terms_count the query is really slow, so it's not an option.

I guess there is a math algorithm to calculate this users intersection between groups but I cannot find it.

Please give an advice.
Thank you in advance!

I'm going to suggest you denormalize your documents, and allow the system to assign document IDs. You can retain your mapping, but the group_id field will become multivalued, so you will have a single document per user. (If you still strongly prefer to use an assigned document ID, just use the user_id.

By doing this, meeting the search requirements you present is relatively simple.

Here I create the index:

PUT intersection
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "group_id": {
        "type": "long"
      },
      "user_id": {
        "type": "long"
      }
    }
  }
}

Then I will index a few user documents

POST intersection/_doc
{"user_id":1,"group_id":[1,2,3,4,5]}

POST intersection/_doc
{"user_id":2,"group_id":[1,3,5]}

POST intersection/_doc
{"user_id":3,"group_id":[1,4,5]}

POST intersection/_doc
{"user_id":4,"group_id":[2,5]}

POST intersection/_doc
{"user_id":5,"group_id":[2,5]}

POST intersection/_doc
{"user_id":6,"group_id":[1,2,6,7,8,9]}

From this group of users, if we are told "find all the users who are exclusively in group 1 or group 1", we expect user_ids 2, 3, 4, and 5.

Notice in particular I put user 6 into groups 6, 7, 8, and 9 all alone. We will use this later to help confirm we are querying properly.

Let me show you just the query snippet to accomplish that (I hope it's clear in isolation):

  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [{"match": {"group_id": 1}}],
            "must_not": [{"match": {"group_id": 2}}]
          }
        },
        {
          "bool": {
            "must": [{"match": {"group_id": 2}}],
            "must_not": [{"match": {"group_id": 1}}]
          }
        }
      ]
    }
  },

Now, given that list of user documents, we want to get a count of how many of them there are in all other groups. For this, I'm going to apply a simple terms aggregation on the group_id field.

  "aggs": {
    "unique": {
      "terms": {
        "field": "group_id",
        "size": 10
      }
    }
  }

Put those together:

POST intersection/_search
{
  "size": 0,
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": [{"match": {"group_id": 1}}],
            "must_not": [{"match": {"group_id": 2}}]
          }
        },
        {
          "bool": {
            "must": [{"match": {"group_id": 2}}],
            "must_not": [{"match": {"group_id": 1}}]
          }
        }
      ]
    }
  },
  "aggs": {
    "unique": {
      "terms": {
        "field": "group_id",
        "size": 10
      }
    }
  }
}

The result is then:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "unique" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 5,
          "doc_count" : 4
        },
        {
          "key" : 1,
          "doc_count" : 2
        },
        {
          "key" : 2,
          "doc_count" : 2
        },
        {
          "key" : 3,
          "doc_count" : 1
        },
        {
          "key" : 4,
          "doc_count" : 1
        }
      ]
    }
  }
}

Notice the hits list is empty, that's because I asked for size: 0, as I'm not really interested in the whole documents. I do see that hits.total.value is 4, which we expect: the group-1 XOR group-2 should match exactly 4 users.

Notice also that the aggregation doesn't have any buckets for 6, 7, 8, or 9, so we know for sure user 6 didn't get selected.

1 Like

Thank you very much for the suggestion. I think this is the best solution. Your help is highly appreciated.

1 Like

That's great!

The solution does work; I suspect that in your system, assignments of groups to users will change over time, and that you will have to code a way to add and remove group_id values to/from user documents.

One way to accomplish this is to retrieve the existing user doc, update the group_id array in the client, and reindex the document, taking care to overwrite or delete the previous version. (Operationally, this is simpler if you have used the user_id as the document id.)

However, I want to show you how to accomplish this without actually retrieving the document in your client code, using either the Update API, if or Update By Query API.

In either case, adding/removing a group_id value to/from the document will be accomplished by a Painless script, and I'll show you tested examples.

First, the script for adding a group_id.
I'm guarding against a doc that stored group_id as an int instead of an array, as well as inserting the same value more than once.
I'm not guarding against other things, like something other than an int having been stored.

  "script": {
    "source": """
      // guarantee _source.group_id is a list
      if (!(ctx._source.group_id instanceof List)) {
        ctx._source.group_id = [ctx._source.group_id]
      }
      // guard against adding the same group more than once
      if (!(ctx._source.group_id.contains(params.value))) {
        // params.value is the group_id to add
        ctx._source.group_id.add(params.value)
      }
    """,
    "lang": "painless",
    "params": {
      "value": 9
    }

Now, the script for removing a group_id. It guards against a non-array source and against trying to remove a group_id that already isn't there.

  "script": {
    "source": """
      // guarantee _source.group_id is a list
      if (!(ctx._source.group_id instanceof List)) {
        ctx._source.group_id = [!(ctx._source.group_id]
      }
      // guard against group_id already not being in user
      if (ctx._source.group_id.contains(params.value)) {
        ctx._source.group_id.remove(
          // find the position in the array, remove it
          ctx._source.group_id.indexOf(params.value)
        )
      }
    """,
    "lang": "painless",
    "params": {
      "value": 9
    }

If you are using your user_id as the document id, you can use the Update API.
(Note that this is still a POST operation, not a PUT operation, as there is server-side processing and not just direct entity replacement.)

POST intersection/_update/99
{
  "script": {
    *** see the examples above ***
  }
}

If you are allowing Elasticsearch to assign document ids, you will use the Update By Query API.

POST intersection/_update_by_query
{
  "script": {
    *** see the examples above ***
  },
  "query": {
    "term": {
      "user_id": {
        "value": 99
      }
    }
  }
}
1 Like

Thank you for the suggestion. It's really helpful.