Finding changes (diffs)

(Lior) #1

I have many records, with many fields on it. 2 of those fields are "special-field" (string) and timestamp (date)
Now lets say i have 1,000,000,000 of those, and i want to find only the documents that the special-field was changed.
for example, after sorting all documents by the timestamp, the value of the first 100 documents of the special-field is "A". In the next 500 documents (101-600) the value is "B", and in documents (601-1000) the value is "A" again. I want to aggregate/query and get only documents 101 and 601 (those with the diff) - is it possible?

(Nik Everett) #2

Maybe. I haven't thought your particular example through hard enough to know for sure. The reason is isn't always "yes" is that efficient solutions to problems like this involve walking all of the documents in a particular order. Elasticsearch's aggregations don't walk in any particular order. And they walk all the shards in parallel. And then they reduce the results late. But you might still be able to get a reasonably efficient solution for your problem with those constraints. Essentially you write an aggregation (probably a min aggregation inside of a terms aggregation in your case) which will yield something that is almost what you want. And then you can use a pipeline aggregation (maybe serial_diff in your case) to get exactly what you want. Or close to it.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.