MIgrating opensearch to elasticsearch using logstash is generating more docs on destination index than there are in source index

Hello,

I am in the process of migrating our opensearch cluster running ES 7.9 as the backend version to an Elasticsearch cluster I built in EC2 that is also running 7.9.3 for the time being. I have the sync running just fine and I am indexing using the doc ID to prevent duplication but the numbers seem to keep growing on the destination index.

For example here are the source OS cluster index stats for contacts-0001:

green open contacts-00001 USm-qjRfTECW1-QKg-LYXQ 6 2 444875573 51669599 450.1gb 148.8gb

Here is the same index on the destination ES side with the sync still running:
green open contacts-00001 stjryAniSvCffEqFjSJXSg 1 1 445715309 62469925 334.3gb 168.4gb

Running some cursory searches in Kibana on specific data points I also have noticed not all data entries have been synced yet.

So my question is, is this normal? Is the destination going to be larger than the source no matter what due to Logstash's handling of the data? How do I tell when the sync is done? This is to be a production cutover so I do not want to point my search to the new instance until I am 100% sure it's accurate and has all entries.

Thanks in advance!

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

You need to check the mapping. Make sure it's the same.

If you are doing a reindex from remote or a Logstash job, I'd recommend going directly to 8.14.0. You could benefit from a lot of improvements.

Thanks for the info, looking at the mappings they are the same, just parsing slightly differently:

Opensearch:

{
  "contacts-00001" : {
    "mappings" : {
      "properties" : {
        "accountId" : {
          "type" : "keyword"
        },
        "company" : {
          "type" : "keyword"
        },
        "email" : {
          "type" : "keyword"
        },
        "firstName" : {
          "type" : "keyword"
        },
        "id" : {
          "type" : "keyword"
        },
        "lastModified" : {
          "type" : "long"
        },
        "lastName" : {
          "type" : "keyword"
        },
        "listId" : {
          "type" : "keyword"
        },
        "recordId" : {
          "type" : "keyword"
        },
        "values" : {
          "type" : "text",
          "analyzer" : "ngram_analyzer"
        }
      }
    }
  }
}

My elastic cluster:


{
  "contacts-00001" : {
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "accountId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "company" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "id" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastModified" : {
          "type" : "long"
        },
        "lastName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "listId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "recordId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "values" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

As for updating the cluster, that is on my to-do list but our company has been running 7.9 for years and our on-prem cluster is still on 7.9.3 at this time. Once I get everything in the cloud I will then update the stack but that has to wait for the time being.

As for the process I am just running a straight one-for-one sync with Logstash as the middleman. I can provide my .conf if needed.

Definitely not the same mappings at all.
You can't compare oranges and apples.

If you look at the mappings the fields are exactly the same. The only additions are:

"@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256

Could you elaborate on how two extra fields results in millions of more documents? Again I am trying to understand why these numbers are off.

I have no idea and it's probably unrelated but what I'm sure is that you should have the same mapping before trying to reindex all your dataset.

Are you using in both cases the same _id for the documents?

Can you share your .conf ?

If you are using a custom _id I do not see how you could get more documents than you have on the source index, you could get less if you had any mapping issues, but it doesn't seem to be the case.