How to find docs that contain the exact specified terms

Hi, I'm trying to get docs that contain the exact specified terms. Terms query doesn't match because it also returns docs with additional terms.

Create index :

PUT my-index
{
   "mappings": {
      "properties": {
        "my-field" : {
           "type" : "keyword"
        }
      }
   }
}

Index a document:

PUT my-index/_doc/1
{
   "my-field": ["A", "B"]
}

Index another document:

PUT my-index/_doc/2
{
   "my-field": ["A", "B", "C"]
}

Use the terms query to find documents that containing A and B:

GET my-index/_search?pretty
{
   "query":
   {
      "terms": {
         "my-field": ["A", "B"]
      }
   }
}

Results :

{
  "took" : 178,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my-index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my-field" : [
            "A",
            "B"
          ]
        }
      },
      {
        "_index" : "my-index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "my-field" : [
            "A",
            "B",
            "C"
          ]
        }
      }
    ]
  }
}

Elastic returns document 2 and document 1 because both contain A and B in my-field. It is the expected behavior, but I'd like to return only documents containing strictly A and B (if a doc contains A and B and C then it should be excluded from the results).

I've read the documentation (filters, aggregations, ...) but I can't find a way.

Thanks in advance for your help :wink:

Hey @Cholley_Alexandre ,

Why not tell Elasticsearch to not return what you don't want ^^.

I added another document to make sure it doesn't appear in the search result:

PUT my-index/_doc/3
{
   "my-field": ["A", "C"]
}

Then tell Elasticsearch to return documents that must not contain C; like this:

GET my-index/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "terms": {
            "my-field": [
              "C"
            ]
          }
        }
      ]
    }
  }
}

More about Compound queries.

Hope it helps!

1 Like

Hi marone,

Thank u for your answer. Yes it's a good idea and it works fine :+1:.

But in my use case (a research tool in agricultural regulatory data), the users need to combine between 1 and 4 values (chemical substances) and they need to exclude all the others (about 500 possible values). So I'll need to know all possible values to exclude before to compose the request. I could compose a first query (filtered aggregation, see : Terms aggregation | Elasticsearch Guide [7.16] | Elastic) to retrieve all the values to exclude and a second query for the search ? But I don't now if it's good for performances... maybe with the aggregation caches... (see: Aggregations | Elasticsearch Guide [7.16] | Elastic)

To finds docs that contain the exact combined terms "A" and "B", in first time I retrieve all values to exclude with a filtered aggregation like this :

GET my-index/_search
{
  "size": 0,
  "aggs": {
    "my-agg-name": {
      "terms": {
        "field": "my-field",
        "exclude": ["A","B"]
      }
    }
  }
}

Result is a bucket with all possibles values (e.g "C", "D", "E", ...) excluding my terms "A" and "B". Programmatically I put these values in an array,
and in a second time I compose a new query with "must_not" and my array of values to exclude:

GET my-index/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "terms": {
            "my-field": ["C", "D", "E", ...] 
          }
        }
      ]
    }
  }
}

Nice :slight_smile:

Well I think you don't have performance issue with the exclude parameter as long as you don't use wildcard.

The performance of the regexp query can vary based on the regular expression provided. To improve performance, avoid using wildcard patterns, such as .* or .*?+ , without a prefix or suffix.

and

Avoid beginning patterns with * or ? . This can increase the iterations needed to find matching terms and slow search performance.

See Regexp and Wildcards docs.

Did I answer your question?

Another idea from me is to make something like fingerprint field. You can implement it by runtime field or ingest pipeline. The former is easier and fit when you can reduce documents to reasonable size by filtering those containing both A and B. You may deduplicate and/or sort values and concatenate them to make fingerprint.

Great :wink: thank you marone.

Hi Tomo_M, it's a good way too. I think you mean I need to create an additional field in my doc that contains a fingerprint of the concatenated terms (e.g. MD5 hash) , and then I request on this fingerprint.

Create index :

PUT my-index
{
   "mappings": {
      "properties": {
        "my-field" : {
           "type" : "keyword"
        },
        "fingerprint-of-axact-terms": {
          "type": "keyword"
        }
      }
   }
}

Index a document:

PUT my-index/_doc/1
{
   "my-field": ["A", "B"],
   "fingerprint-of-exact-terms": "5ae395e8ab6a4121fc3445afdde6b13f"
}

When a user query on exact terms A and B, I create a hash of these terms and I query on fingerprint field:

GET my-index/_search?pretty
{
   "query":
   {
      "terms": {
         "fingerprint-of-exact-terms": "5ae395e8ab6a4121fc3445afdde6b13f"
      }
   }
}

But the problem is that the order of the terms can give different fingerprint ? (eg. A and B = 5ae395e8ab6a4121fc3445afdde6b13f / B and A = 751bf9b43937fba8970400edf1e28a5d). Otherwise you must set terms in the same order before index and when you query.

Exactly. It depends on the requiements.

If the order has meaning, different fingerprint from different order make sense. If not, order should not affect the fingerprint and you have to sort values before making fingerprint.

In addition, in this case, just using concatenated values (eg 'A_B') could be acceptable for fingerprint to distinguish identical set.

This is my sample code. Using runtime mappings, you need not calculate hash in indexing by yourself.

PUT /test_runtime/
{
  "mappings": {
    "properties": {
      "my-field":{
        "type": "keyword"
      }
    }
  }
}

POST /test_runtime/_bulk
{"index":{}}
{"my-field":["A", "B", "C"]}
{"index":{}}
{"my-field":["A", "A","B"]}
{"index":{}}
{"my-field":["A", "B"]}
{"index":{}}
{"my-field":["B","A"]}
{"index":{}}
{"my-field":"A"}
{"index":{}}
{"my-field":["A"]}

GET /test_runtime/_search
{
  "query":{
    "bool":{
      "filter":[
        {"term":{"my-field":"A"}},
        {"term":{"my-field":"B"}},
        {"term":{"fingerprint":"A_B"}}
      ]
    }
  },
  "runtime_mappings": {
    "fingerprint": {
      "type": "keyword",
      "script": {
        "source": """
        if (doc['my-field'] instanceof List) {
          Set set = new HashSet(doc['my-field']);
          List list = new ArrayList(set);
          Collections.sort(list);
          String str = String.join('_',list);
          emit(str)
        } else {
          emit(doc['my-field'].value)
        }
        """
      }
    }
  },
  "fields": [
    "fingerprint"
  ]
}

Or you can add ingest script processor to calcurate hash. This way has better performance on querying.

Using ingest pipeline is like this.

PUT _ingest/pipeline/test_fingerprint_pipeline/
{
  "description":"",
  "processors":[
    {
      "script":{
        "source": """
        ctx['a'] = 'a';
        if (ctx['my-field'] instanceof List) {
        Set set = new HashSet(ctx['my-field']);
        List list = new ArrayList(set);
        Collections.sort(list);
        String str = String.join('_',list);
        ctx['fingerprint'] = str
      } else {
        ctx['fingerprint'] = ctx['my-field']
      }"""
      }
    }
  ]
}

PUT /test_pipeline/
{
  "mappings": {
    "properties": {
      "my-field":{
        "type": "keyword"
      },
      "fingerprint":{
        "type": "keyword"
      }
    }
  },
  "settings":{
    "index.default_pipeline": "test_fingerprint_pipeline"
  }
}
GET test_pipeline

POST /test_pipeline/_bulk
{"index":{}}
{"my-field":["A", "B", "C"]}
{"index":{}}
{"my-field":["A", "A","B"]}
{"index":{}}
{"my-field":["A", "B"]}
{"index":{}}
{"my-field":["B","A"]}
{"index":{}}
{"my-field":"A"}
{"index":{}}
{"my-field":["A"]}

GET /test_pipeline/_search
{
  "track_total_hits": true, 
  "query":{
    "bool":{
      "filter":[
        {"term":{"fingerprint":"A_B"}}
      ]
    }
  }
}

Hi Tomo_M, thanks a lot for your samples. It's really interesting :wink: and it helps me to understand the elastic capabilities. In my use case I need to re-index new updated data every week (about 150k docs), so the ingestion pipeline is interesting.

I will try all these ways. Thank you all

2 Likes

One more thing to add I forgot, since you are using Keywords you don't have to worry about performance, because keywords fields are not analyzed compared to text field. From this perspective you are safe.

1 Like

In fact in my real use case, the mapping is like this :

"my-field": {
   "type": "text",
   "fields": {
      "completion": {
          "analyzer": "default",
          "type": "completion"
        },
        "keyword": {
           "type": "keyword"
      }
   }
}

I think that can't change anything for performances as long as I use my-field.keyword ?

Thanks.

Hi @Cholley_Alexandre ,

I found another solution.

If you don't mind duplication of 'A' or 'B',

GET /test_runtime/_search
{
  "query":{
    "bool": {
      "filter": [
        {"term": {"my-field": "A"}},
        {"term": {"my-field": "B"}}
      ],
      "must_not": [
        {
          "regexp": {
            "my-field": "~(A|B)"
          }
        }
      ]
    }
  }
}

will return what you want. If you want to distinguish duplication or order, you may need custom fingerprint strategy as before.

1 Like

Hi @Tomo_M,

Yes it's a perfect solution for my use case. I don't need to distinguish duplication or order.

If I understand correctly, you must also use multiple "term" in query (one "term" for each value) and not a global "terms" query in (with an array of values) :

Works well with multiple "term" in query :

GET /test_runtime/_search
{
  "query":{
    "bool": {
      "filter": [
        {"term": {"my-field": "A"}},
        {"term": {"my-field": "B"}}
      ],
      "must_not": [
        {
          "regexp": {
            "my-field": "~(A|B)"
          }
        }
      ]
    }
  }
}

Doesn't work with a global "terms" in query :

GET /test_runtime/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "my-field": [
              "A",
              "B"
            ]
          }
        }
      ],
      "must_not": [
        {
          "regexp": {
            "my-field": "~(A|B)"
          }
        }
      ]
    }
  }
}

And "regexp" with "must_not" is use to make sure that all other terms are excluded.

Thanks :wink:

1 Like

Yes, we have to use multiple term queries because terms query means funcs as OR.
If you satisfied with my post, please mark it as Solution. Thanks :grinning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.