Our csharp dotnet application uses MongoDB for ACID compliance and primarily as the main data storage. We are in the process of moving/mirroring some of the information from our key collections to Elastic. So the transactional data will remain in Mongo. The full text search capability is what is going to be leveraged from Elastic.
There are multiple ways to do the data ingestion to elastic like:
1. Use Elasticsearch-River-MongoDB plugin to sync data from mongo to elastic - obsolete maybe
2. Build own CDC pipeline from MongoDB as source and Elastic as sink
3. Use mongo streams to do real time push and use event bridge - rabbitMQ - elastic
(our product is both on cloud and on-prem hence maybe event bridge - Amazon SQS - AWS lambda maynot be feasible and inclined towards existing rabbitMQ running on both)
4. Logstash
5. Other options like mongo connect, monstache.
we are thinking of doing it asynchronously using Mongo streams to have near real time experience and low replication lag. Do you see any issues or things we should be aware of? Are there any other better options available?
on replication we have this SLA - We should build for the small amount of replication lag possible and should set 100 ms as an acceptable data replication lag in updating changes to Elasticsearch. Also FYI - We only replicate the latest version of the collection data, not past revisions.
Rivers have not been a thing in Elasticsearch for probably 10 years, so I am not sure where you got this from.
I do not think Logstash is an option here, so would probably recommend option 2 or 3.
In addition to the replication lag, which is just the time to get the changes indexed into Elasticsearch, it is important to know that data is not immediately searchable in Elasticsearch once it has been index. By default it can take up to 1 second for a refresh to happen, which is what makes the documents searchable. This can be reduced, but doing so can have a potentially serious negative impact of performance.
Thanks @Christian_Dahlqvist for the reply. It makes sense.
I am doing a POC on using mongo streams for async replication.
So as part of this change data capture we need to support insert, updates and deletes too. We need to be able to delete a record from Elasticsearch if it’s deleted in Mongo. Any specific mentions you want to call out w.r.t delete. I read that Docs in ES are immutable and hence can’t be deleted or modified.
Every segment on disk has a .del file associated with it. On delete request, doc is not deleted but marked as deleted in .deleted file and not shown up in the subsequent search requests. Is that correct?
It is segments that are immutable, not documents. You can therefore delete and update documents. One special case is data streams, which are specifically designed to work with immutable data and impose some restrictions.
Deleted documents will be removed from search results when the next refresh happens, so this is not real-time. This is also when updates and new documents become searchable. Up until that point the documents only live in the transaction log, which is not searchable.
Thanks @Christian_Dahlqvist so much for your insights. 1 follow up question.
When using MongoDB streams to asynchronously ingest data into Elasticsearch for full-text search any potential issues to be aware of? We are using Mongo 6 and Elastic 8. FYI we have both on-prem and cloud deployments. For message queue since we already have RabbitMQ running on both trying to leverage it vs introducing Kafka.
Also since MongoDB and Elasticsearch have different data models. Any suggestions/recommendations in defining the schema mapping strategy to transform MongoDB documents into Elasticsearch documents?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.