We are creating a new service which using MongoDB change stream and queue would index the data from our collection to Elasticsearch in real time for any any new changes(inserts,updates,deletes) happening on Mongo.
However change stream cannot be used to migrate existing data. what would be the best way to do it with zero downtime and no data loss? Can we use monstache or other connectors? we are using MongoDb version 6 and Elastic version 8. FYI we have both on-prem and cloud deployments.
This is what I read - 1. Mongo connector does not have good support for ES 6+ and repository not maintained.
Known issues: As per the Elastic MongoDB connector reference documentation, there was a bug introduced in version 8.12.0 that required SSL/TLS to be enabled for MongoDB Atlas URLs (mongo+srv) to sync correctly.
How about Monstache to migrate existing data from MongoDB and use change streams for real-time ingestion into Elasticsearch? Also can we use the bulk api to do this one time data migration?
Elastic as a company does not support Monstache, it's the Monstache developers the ones who provide and maintain that software. As I haven't used Monstache before, I can't say about the features or support it has.
Elasticsearch Connectors are being developed in Elastic and are supported as part of the Elastic ecosystem.
We are planning to have 1 index per tenant and we have 300 odd tenants with default 1 shard per index to start with. Regarding the bulk api I had few doubts.
For the incremental changes, assuming once the existing data is migrated using Elasticsearch Connectors or Monstanche, we are using change streams listening to per tenant DB. So now lets say I have 5 concurrent updates to different documents of the collection we are going to index to elastic. Assume for these 5 updates are for 1 tenant and then we have some other inserts/updates for different tenants. How does the bulk api actually work?
I mean since we want to isolate the data for each tenant and thats why we are keeping 1 index per tenant,
the bulk api can be used for events that belong to 1 tenant right? If we maybe be fix the change events for different tenant how will the filtering of data per tenant happen?
even for the same tenant using bulk api would mean introducing some lag right?
the lag here I meant was if these operations were done by single api calls for create, update and delete, then immediately as soon as the event was received ES will be called. however if we use bulk api the caller service will have to wait for certain time or for certain events to accumulate in that window and then call the bulk api right?
It certainly will be more efficient to batch operations together in a bulk request than performing individual ones. Doing some buffering of the updates and batching them in bulk requests makes sense.
You will need to find the balance for your workload, and experiment with the buffering sizes / time.
Is it possible to create a data stream instead of a plain index for ingestion connector? (That search-...)
Or Is it possible to setup a totally arbitral name for the target index to ingest?
I would say that data streams are append-only and connectors are designed to update the documents when there is a change - so I don't think that data streams would be a good match for connectors.
I think you should post this question separately in the discuss forum. It is related to the previous question, but can also be answered independently. Also, it would benefit from connectors experts.
Please ask in Elastic Search using the tag connectors. Thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.