I have list of document ids. When I query Elasticsearch, In result I need document ids which are matching my criteria and documentIds for which documents do not exist in Elasticsearch.
Background on use case
I need to search for missing reports for sites. DocumentId of report is the siteId.
A report is missing if its reportState is NEW (not submitted) or if it doesn't exist in Elasticsearch.
Expected Request and Response
Request: all sites i.e documentIds in request to Elasticsearch
[ABC1, ABC2, ABC3]
Documents in Elasticsearch
[ABC1 with reportState NEW, ABC2 with reportState SUBMITTED, ABC3 document doesn't exist with this documentId]
Now I want to build query X such that my response will retrieve below documentIds.
[ABC1, ABC3]
A query can never return documents that do not exist, so I do not see how you can build a query to return what you describe. If you knew you expected these two documents you could write a query to return the first one and deduce the other one is not present in the client application.
Elasticsearch Response: [ABC1]
By deducing other ones are not present [ABC2, ABC3], computed response becomes [ABC1, ABC2, ABC3]
Whereas expected response is [ABC1, ABC3].
The issue here is if I fire a query with just one criteria ReportState=NEW, I do not know whether ABC2 didn't match my query or whether it didn't even exist.
A query can not return a document that does not exist, so you need to find a way to handle this in your application logic or perhaps run multiple queries.
Ofcourse I can handle this in application logic but its going to be a heavy operation.
I am looking for an optimised approach to achieve this. Using mget is the only way I could find so far, but I need pagination as well over missing reports.
Can you perhaps initially create documents matching all reports you expect and let these have a special field that indicate that they do not yet exist. When you submit a report you overwrite this with the proper document. This way you might be able to adjust your search criteria to return these together with the matching ones.
We have DynamoDB as primary storage, we would require to build a complex mechanism to create documents by default for our use case. That would be last option for us. Looking to solve using Elasticsearch itself.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.