Comparison of data between two indices

Hello,

I'm looking for suggestions.
I have two indices, for example:

  1. index-a-2025
  2. index-b-2025

Both indices contain Beat host names on which they are deployed. I want to compare both indices to identify any missing host names (i.e., the difference) between them.

Assuming I know that there are 1,000 Beat host names (stored in the field beats_host_name) in index-a-2025 and 980 Beat host names (stored in the field beats_stats_host_name) in index-b-2025, what would be the best way to quickly retrieve the 20 host names that exist in index-a-2025 but not in index-b-2025?

Could this be achieved using a Watcher that’s manually executed and returns those 20 host names? Or perhaps through a visualization in a dashboard?
I would really appreciate any guidance or suggestions.

Hello @s.buksa

We can use ES|QL which can help to retrieve the host names using below query :

Details
Index-a : 

{
  "index-a-2025": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "beats_host_name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Index b : 

{
  "index-b-2025": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
       "beats_stats_host_name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

ES|QL Query (logic count < 2 means it has single entry) : 

FROM index-*
| EVAL host = COALESCE(beats_host_name, beats_stats_host_name)
| STATS count = COUNT(host) BY host
| WHERE count < 2



With Watcher also it should be possible but this i believe will be more simple than the Watcher transform logic.

Thanks!!

Thank you for your suggestion @Tortoise

I quickly tried to implement it, but it seems that this will not be the solution in my case. To run ES|QL cross-cluster search a valid Enterprise license is required.

One of the indices (index-b-2025) contains necessary data from another cluster. Basically, in my environment I have two indices - one local, another one from another cluster.

1 Like

First of all, I wold not make the assumption that there are 20 differences. Trust, but verify !!
It might be there are 900 hosts in common, and then 100 in index--a that are not in index-b, and 80 in index=b that are not in index-a !!

@Tortoise

FROM index-*
| EVAL host = COALESCE(beats_host_name, beats_stats_host_name)
| STATS count = COUNT(host) BY host
| WHERE count < 2

This also makes assumptions, that there's only one doc per index per hostname.

e.g. consider what would be the desired output for this data:

POST discuss-test-1/_doc
{ "hostname1" : "mars" }
POST discuss-test-1/_doc
{ "hostname1" : "venus" }
POST discuss-test-1/_doc
{ "hostname1" : "neptune" }
POST discuss-test-1/_doc
{ "hostname1" : "mars" }
POST discuss-test-1/_doc
{ "hostname1" : "neptune" }

POST discuss-test-2/_doc
{ "hostname2" : "mars" }
POST discuss-test-2/_doc
{ "hostname2" : "earth" }
POST discuss-test-2/_doc
{ "hostname2" : "venus" }
POST discuss-test-2/_doc
{ "hostname2" : "jupiter" }
POST discuss-test-2/_doc
{ "hostname2" : "jupiter" }
POST discuss-test-2/_doc
{ "hostname2" : "earth" }

A curl/sort/uniq/comm command pipe is not hard to write to cover list the differences, and requires nothing above basic license.

# in index1 but not in index2
$ comm -13  <(escurl 'discuss-test-1/_search' | jq -r '.hits.hits[]._source.hostname1' | sort | uniq) <(escurl 'discuss-test-2/_search' | jq -r '.hits.hits[]._source.hostname2' | sort | uniq)
earth
jupiter

# in index2 but not in index1
$ comm -23  <(escurl 'discuss-test-1/_search' | jq -r '.hits.hits[]._source.hostname1' | sort | uniq) <(escurl 'discuss-test-2/_search' | jq -r '.hits.hits[]._source.hostname2' | sort | uniq)
neptune

# in both index1 and index2
$ comm -12  <(escurl 'discuss-test-1/_search' | jq -r '.hits.hits[]._source.hostname1' | sort | uniq) <(escurl 'discuss-test-2/_search' | jq -r '.hits.hits[]._source.hostname2' | sort | uniq)
mars
venus

Obviously query requires some work, e.g. higher limit if indices has loads of documents, but general concept is I hope clear

1 Like