Retrieve documents matching criteria along with missing documentIds

I have list of document ids. When I query Elasticsearch, In result I need document ids which are matching my criteria and documentIds for which documents do not exist in Elasticsearch.

Background on use case
I need to search for missing reports for sites. DocumentId of report is the siteId.
A report is missing if its reportState is NEW (not submitted) or if it doesn't exist in Elasticsearch.

Expected Request and Response

Request: all sites i.e documentIds in request to Elasticsearch
[ABC1, ABC2, ABC3]

Documents in Elasticsearch
[ABC1 with reportState NEW, ABC2 with reportState SUBMITTED, ABC3 document doesn't exist with this documentId]

Now I want to build query X such that my response will retrieve below documentIds.
[ABC1, ABC3]

{
    "query": {
        "bool": {
            "should": [
                {
                    "term": {
                        "reportState": "NEW"
                    }
                },
                {
                    //Build something like below specifying document doesn't exist
                    "must_not": {
                        "exists": {
                            "field": "reportState"
                        }
                    }
                }
            ]
        }
    }
}

Available Alternative
I found an approach using mget on Check documents not existing at elasticsearch - Stack Overflow but I require to paginate over results, hence looking for a query.

can you please write question with example...where we can also operate on kibana to find solution

A query can never return documents that do not exist, so I do not see how you can build a query to return what you describe. If you knew you expected these two documents you could write a query to return the first one and deduce the other one is not present in the client application.

I updated the question with sample query.

I cannot do that. With your logic

Request: [ABC1, ABC2, ABC3]

Elasticsearch Response: [ABC1]
By deducing other ones are not present [ABC2, ABC3], computed response becomes [ABC1, ABC2, ABC3]
Whereas expected response is [ABC1, ABC3].

The issue here is if I fire a query with just one criteria ReportState=NEW, I do not know whether ABC2 didn't match my query or whether it didn't even exist.

A query can not return a document that does not exist, so you need to find a way to handle this in your application logic or perhaps run multiple queries.

Ofcourse I can handle this in application logic but its going to be a heavy operation.

I am looking for an optimised approach to achieve this. Using mget is the only way I could find so far, but I need pagination as well over missing reports.

Can you perhaps initially create documents matching all reports you expect and let these have a special field that indicate that they do not yet exist. When you submit a report you overwrite this with the proper document. This way you might be able to adjust your search criteria to return these together with the matching ones.

We have DynamoDB as primary storage, we would require to build a complex mechanism to create documents by default for our use case. That would be last option for us. Looking to solve using Elasticsearch itself.

That is as far as I know not possible.

Hi,

Try wrapping must_not query with another bool query because must_not query works only with bool query

{
    "query": {
        "bool": {
            "should": [
                {
                    "term": {
                        "reportState": "NEW"
                    }
                },
                {
                    //Build something like below specifying document doesn't exist
                    "bool":[
                        "must_not": {
                              "exists": {
                                   "field": "reportState"
                                }
                         }
                     ]
                }
            ]
        }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.