Differences between 2 groups of documents by multiple criteria

Hello everyone,

Please take a look at this dummy documents:

[
    {
        "project": "a",
        "run": 1,
        "tool_name": "tool_1",
        "tool_output": "abc.com"
    },
    {
        "project": "a",
        "run": 1,
        "tool_name": "tool_1",
        "tool_output": "xyz.com"
    },
    {
        "project": "a",
        "run": 2,
        "tool_name": "tool_1",
        "tool_output": "abc.com"
    },
    {
        "project": "a",
        "run": 2,
        "tool_name": "tool_1",
        "tool_output": "xyz.com"
    },
    {
        "project": "a",
        "run": 2,
        "tool_name": "tool_1",
        "tool_output": "new.com"
    }
]

I need to find tool output differences between this 2 runs - new tool outputs appearing in run 2. I want to get result like this:

{
    "new_tool_outputs_count": 1,
    "new_tool_outputs": ["new.com"]
}

In SQL analogy it is something like outer join.

I probably need to use some aggregation query, but I am not sure is something like this possible with elasticsearch. Also, the number of documents for comparison can reach a scale of millions.

Top of mind I think you could solve this in 2 stages with transform. In the 1st stage you collapse (group_by) project/runs, so you have 1 document per project and run.

In the 2nd stage you compare the runs, we have an example in the docs for comparing indices.

Whether you use 1 transform overall or 2 depends on the number of projects and the number of runs you have per project: If its always only 2 runs, I think it can be done with 1 transform. In this case I would only group_by project.

I hope this gives you some idea, scale shouldn't be a problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.