Best way to migrate data from mongoDB to elastic

Hi All,

We are creating a new service which using MongoDB change stream and queue would index the data from our collection to Elasticsearch in real time for any any new changes(inserts,updates,deletes) happening on Mongo.
However change stream cannot be used to migrate existing data. what would be the best way to do it with zero downtime and no data loss? Can we use monstache or other connectors? we are using MongoDb version 6 and Elastic version 8. FYI we have both on-prem and cloud deployments.

Thanks,
Moni

Hi @Moni_Hazarika !

I believe you could use Elasticsearch MongoDB connector for migrating your data. There's both a Cloud and self-managed version for it.

From Elasticsearch to Elastic Search

Added connectors

Thanks for the reply.

This is what I read - 1. Mongo connector does not have good support for ES 6+ and repository not maintained.
Known issues: As per the Elastic MongoDB connector reference documentation, there was a bug introduced in version 8.12.0 that required SSL/TLS to be enabled for MongoDB Atlas URLs (mongo+srv) to sync correctly.

How about Monstache to migrate existing data from MongoDB and use change streams for real-time ingestion into Elasticsearch? Also can we use the bulk api to do this one time data migration?

Elasticsearch MongoDB connector is currently supported for 8.x versions and beyond, and maintained. Are you referring to another connector?

there was a bug introduced in version 8.12.0 that required SSL/TLS to be enabled for MongoDB Atlas URLs (mongo+srv) to sync correctly.

If enabling SSL for Atlas is a problem, this bug has still not been corrected.

How about Monstache to migrate existing data from MongoDB and use change streams for real-time ingestion into Elasticsearch?

Elastic does not support Monstache - you can give it a try and let us know your experience.

Using change streams sound like a good approach for real-time ingestion.

Also can we use the bulk api to do this one time data migration?

Sure, using bulk API should be the way to go for doing batch ingestion into Elasticsearch to optimize the ingestion process.

Sorry can you elaborate more on this? what exactly you mean by elastic does not support Monstanche?

Elastic as a company does not support Monstache, it's the Monstache developers the ones who provide and maintain that software. As I haven't used Monstache before, I can't say about the features or support it has.

Elasticsearch Connectors are being developed in Elastic and are supported as part of the Elastic ecosystem.

1 Like

okay thanks for the explanation.

We are planning to have 1 index per tenant and we have 300 odd tenants with default 1 shard per index to start with. Regarding the bulk api I had few doubts.
For the incremental changes, assuming once the existing data is migrated using Elasticsearch Connectors or Monstanche, we are using change streams listening to per tenant DB. So now lets say I have 5 concurrent updates to different documents of the collection we are going to index to elastic. Assume for these 5 updates are for 1 tenant and then we have some other inserts/updates for different tenants. How does the bulk api actually work?
I mean since we want to isolate the data for each tenant and thats why we are keeping 1 index per tenant,

  1. the bulk api can be used for events that belong to 1 tenant right? If we maybe be fix the change events for different tenant how will the filtering of data per tenant happen?
  2. even for the same tenant using bulk api would mean introducing some lag right?

Bulk API allows to perform operations on different indices:

POST _bulk
{ "index" : { "_index" : "test1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test2", "_id" : "2" } }
{ "create" : { "_index" : "test3", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test4"} }
{ "doc" : {"field2" : "value2"} }

Each doc line is preceded by a metadata line that indicates the action and the target index.

If you mean the lag between the indexing is done vs when it is available for search, yes there is. There's a tradeoff between tuning for indexing speed and tuning for search speed.

You will need to check what your refresh interval should be, and the overall indexing strategy, depending on your data size and needs.

Bulk API is the most effective way of dealing with bulk changes and the way to go for your integration with change streams.

It would also be interesting to check this post with a discussion of multi tenancy implementation in Elasticsearch.

the lag here I meant was if these operations were done by single api calls for create, update and delete, then immediately as soon as the event was received ES will be called. however if we use bulk api the caller service will have to wait for certain time or for certain events to accumulate in that window and then call the bulk api right?

Thanks for clarifying @Moni_Hazarika !

It certainly will be more efficient to batch operations together in a bulk request than performing individual ones. Doing some buffering of the updates and batching them in bulk requests makes sense.

You will need to find the balance for your workload, and experiment with the buffering sizes / time.