Subset match query

Hi everyone,

Thanks for reading.

Here's my problem. I have the following documents:

"course #1": {
    "requirements": [
        {
            "name": "AWS EC2 certification",
            "type": "certification"
        },
        {
            "name": "Python Advanced",
            "type": "certification"
        },
        {
            "name": "Critical Thinking",
            "type": "skill",
        }
}

"course #2": {
    "requirements": [
        {
            "name": "Ruby on Rails advanced",
            "type": "certification"
        },
        {
            "name": "Critical Thinking",
            "type": "skill",
        }
}

I want the users to find the courses where they match all the requirements, for example:

The user that has "Critical Thinking", "Ruby on Rails advanced" and "Python Advanced", should get only the course #2. In other words, I want to find the courses where the requirements are a subset of the user skills.

I saw posts related to this issue but are very old and I'm hoping to find a clean solution.

Thanks. I hope you have some inputs.

Start simple with terms and terms_set query. After that you can add a nested layer on top.

Example, let's map requirements as simple keyword field:

PUT test_index
{
  "mappings": {
    "properties": {
      "requirements": {
        "type": "keyword"
      }
    }
  }
}

Indexing some docs:

POST test_index/_bulk
{ "index" : { "_id" : "course1" } }
{"requirements": ["A", "B", "C"]}
{ "index" : { "_id" : "course2" } }
{"requirements": ["D", "C"]}

The query below returns you only courses that satisfy all your requirements.

GET test_index/_search
{
 "query": {
   "terms_set" : {
     "requirements" : {
       "terms" : ["C", "D"],
       "minimum_should_match_script": {
          "source": "params.num_terms"
        }
     }
   }
 }
}

Hi Mayya, thanks for the quick response.
The solution you posted is working fine, but when I tried to add a nested layer on top, it didn't work. So I was wondering if the terms_set works with nested objects, I looked this post but it has no answers Terms_set within a nested query, and what is happening to me is exactly that. When I used a nested query, there are no results in the response.

It looks like this:

{
        nested: {
          path: 'requirements',
          query: {
            terms_set: {
              'requirements.name': {
                terms: ["Python Advanced", "Critical Thinking"],
                minimum_should_match_script: {
                  source: "params.num_terms"
                },
              },
            }
          }
        }
}

And it works when I copy/paste the one you posted

I've realized that I have provided a wrong solution with terms_set query, as it would not returned subset, but rather a superset; that is you would not be able to find the courses where the requirements are a subset of the users skills, but rather find the courses where the user skills are the subset of course requirements.

Again, disregarding a nested part, to find a correct subset, you would need to writer a custom script that would compare if each course requirement is within user skills, something like this:

GET test_index/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": """
            for (int i = 0; i < doc['requirements'].length; i++) {
              boolean found = false;
              for(int j = 0; j < params.user_skills.length; j++) { 
                if (doc['requirements'][i] == params.user_skills[j]) {
                  found = true;
                  break;
                }
              }  
              if (found == false) return false;
            }
            return true;
            """,
            "params": {
              "user_skills": ["C", "D"]
            }
          }
        }
      }
    }
  }
}

You can't use this for nested objects, as each nested object is a separate document. Some possible solutions I can see are:

  1. Don't use nested, and just use "requirements" as a keyword field with multiple values (the way I did in my example)
  2. You can copy nested values in a root/parent object with include_in_parent and include_in_root options, and run queries on a corresponding root field.

May be there are some other clever ways to do it for nested objects, but nothing comes to my mind right now