I have an index with documents that only have 3 fields: id, timestamp and status.
I want to retrieve the newest document for each id, but: if that document's status is equals to "something", then I would like to ignore that bucket completely.
What's the best way to do this?
I was able to retrieve the newest document for each id with the query below but I don't know how to filter buckets based on the document's status.
Use a query other than "match_all". You can use a bool query with a must_not expression and put a "match" expression ion there for the status you want to ignore.
If I filter out status A in the query, I'd get a bucket with the second event (status B).
What I really want is to filter out the bucket if the newest document (in this example, the first event) contains status A. So, the result of my query would be 0 buckets or a single empty bucket (with no hits).
One of the issues with trimming via bucket selectors is that after trimming you may find you have no buckets left at all and have to go back and ask for more data with more searches. It can be a workable solution but depends on the data and the worst case scenario is very inefficient.
please have a look at this painless example (This should also work in older versions).
For filtering out complete buckets/documents I suggest to use a drop processor that runs after the pivot. This can be done with an ingest pipeline, which you can specify as part of the transform destination.
I followed the painless example and I was able to transform the index into a "last document" index grouped by id.
As my source index is prefixed by the date the doc was put (sample indexes: docs-2020-05-25, docs-2020-04-24, ...), can I do the same for the index generated by the transform API?
I would like to generate indexes like: latest-doc-by-id-2020-05-25, latest-doc-by-id-2020-05-24, ...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.