MIgrating opensearch to elasticsearch using logstash is generating more docs on destination index than there are in source index

Mat_Wojdyla · June 11, 2024, 3:13pm

Hello,

I am in the process of migrating our opensearch cluster running ES 7.9 as the backend version to an Elasticsearch cluster I built in EC2 that is also running 7.9.3 for the time being. I have the sync running just fine and I am indexing using the doc ID to prevent duplication but the numbers seem to keep growing on the destination index.

For example here are the source OS cluster index stats for contacts-0001:

green open contacts-00001 USm-qjRfTECW1-QKg-LYXQ 6 2 444875573 51669599 450.1gb 148.8gb

Here is the same index on the destination ES side with the sync still running:
green open contacts-00001 stjryAniSvCffEqFjSJXSg 1 1 445715309 62469925 334.3gb 168.4gb

Running some cursory searches in Kibana on specific data points I also have noticed not all data entries have been synced yet.

So my question is, is this normal? Is the destination going to be larger than the source no matter what due to Logstash's handling of the data? How do I tell when the sync is done? This is to be a production cutover so I do not want to point my search to the new instance until I am 100% sure it's accurate and has all entries.

Thanks in advance!

system · June 11, 2024, 3:13pm

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

dadoonet · June 11, 2024, 6:24pm

You need to check the mapping. Make sure it's the same.

If you are doing a reindex from remote or a Logstash job, I'd recommend going directly to 8.14.0. You could benefit from a lot of improvements.

Mat_Wojdyla · June 11, 2024, 7:17pm

Thanks for the info, looking at the mappings they are the same, just parsing slightly differently:

Opensearch:

{
  "contacts-00001" : {
    "mappings" : {
      "properties" : {
        "accountId" : {
          "type" : "keyword"
        },
        "company" : {
          "type" : "keyword"
        },
        "email" : {
          "type" : "keyword"
        },
        "firstName" : {
          "type" : "keyword"
        },
        "id" : {
          "type" : "keyword"
        },
        "lastModified" : {
          "type" : "long"
        },
        "lastName" : {
          "type" : "keyword"
        },
        "listId" : {
          "type" : "keyword"
        },
        "recordId" : {
          "type" : "keyword"
        },
        "values" : {
          "type" : "text",
          "analyzer" : "ngram_analyzer"
        }
      }
    }
  }
}

My elastic cluster:


{
  "contacts-00001" : {
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "accountId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "company" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "id" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastModified" : {
          "type" : "long"
        },
        "lastName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "listId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "recordId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "values" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

As for updating the cluster, that is on my to-do list but our company has been running 7.9 for years and our on-prem cluster is still on 7.9.3 at this time. Once I get everything in the cloud I will then update the stack but that has to wait for the time being.

As for the process I am just running a straight one-for-one sync with Logstash as the middleman. I can provide my .conf if needed.

dadoonet · June 11, 2024, 9:19pm

Definitely not the same mappings at all.
You can't compare oranges and apples.

Mat_Wojdyla · June 12, 2024, 1:41pm

If you look at the mappings the fields are exactly the same. The only additions are:

"@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256

Could you elaborate on how two extra fields results in millions of more documents? Again I am trying to understand why these numbers are off.

dadoonet · June 12, 2024, 8:52pm

I have no idea and it's probably unrelated but what I'm sure is that you should have the same mapping before trying to reindex all your dataset.

Are you using in both cases the same _id for the documents?

leandrojmp · June 12, 2024, 11:32pm

Can you share your .conf ?

If you are using a custom _id I do not see how you could get more documents than you have on the source index, you could get less if you had any mapping issues, but it doesn't seem to be the case.

Topic		Replies	Views
Storage Ratios - I my syslog streams are expanding in elastic search to more than 10:1? Elasticsearch	5	502	July 6, 2017
Logstash ES->ES more documents out than in? Logstash	5	511	April 18, 2019
Copy docs from one index to another index Elasticsearch	22	28311	July 5, 2017
Elasticsearch data indexing for logstash Elasticsearch	10	1986	July 5, 2017
Logstash stop communicating with Elasticsearch Elasticsearch	4	592	July 6, 2017

MIgrating opensearch to elasticsearch using logstash is generating more docs on destination index than there are in source index

Related topics