I have the following setup filebeats -> logstash -> elasticsearch with logstash output {
elasticsearch {
index => "idx-%{+YYYY-MM-dd}"
document_id => "%{myDocID}"
}
where myDocID is created in the filter section by combining the filename and a itemNum fields from the event. An alias called idx-all is created that includes every idx-*.
If filebeats is watching file foo which has 2 lines on monday, containing
001 Status Fail
002 Status Success
Then ES creates 2 docs in the idx-2018-05-14 with ids (foo.001 and foo.002).
I have an aggregation that wants to count the number of different statuses, so it uses idx-all as the index. The agg should return:
Success = 1
Fail = 1
On Tuesday, the foo file gets another line:
001 Status Success
And EX creates 1 doc in the tuesday index idx-2018-05-15 with id (foo.001).
The agg run against idx-all we would like to now be:
Success = 2
Fail = 0
However, since the 001 doc from monday still exists and we're agg using the alias idx-all, I'll get 001 from both days and the count will be:
Success = 2
Fail = 1
I have 10K records on average so I'd prefer to do this in one ES query rather than pulling things back in Java and processing there. I have the agg counting query, and a different query using aggs to try and remove the older duplicates but each item (eg 001, 002) ends up in its own bucket. I don't know how to put these together into a single query, if possible.
What would the single query look like that would first de-dup across indices and then agg across that result set?