I can not predict exactly what problems you will face, but suspect they will arise from where your theoretical model meets real world problems and software limitations. One example is CCS, which was never designed to work at this scale. If you find issues around this in Elasticsearch I suspect fixing those might not be considered high priority assuming that the feature scales to the level that was intended and this is a very unusual edge case. You may therefore end up having to query all clusters in parallel from your application, and if you go down that route I suspect you could face issues with performance and network reliability as it is a very large distributed system.
The algorithm you describe seems to be a standard scatter-gather, and there is not anything wrong with this. Getting something like this to work at scale across a large distributed system with good performance is however quite challenging in practice.
I would expect you to save yourself a lot of potential trouble by going with a more standard approach even if that requires a bit more hardware.
Yes, agree with you... this could be challenging. So, summarizing, I create an ES instance on each node for local querying and at the same time send all the data to the central cluster (or several clusters). Then the user queries the central cluster(s) only. Sounds good.
Is there a standard way to sync local nodes with the central one, so that the data will be consistent in case of network disconnections? In my environment the disconnections could be long, even several days. Then the GUI just shows that this particular node is not available. The problem is that when the node re-connects it should somehow sync the missing updates. I can buffer the updates on the node for a short time, say hours, to handle spontaneous disconnections, but not for days. From David's answer above I understand that defining a replica on the central place for each node is not a good solution. What would be the right approach then?
I would recommend writing to the local node and the central cluster in parallel, but if you have lengthy disconnections this means you need to enquire the data locally until the connection is established. This is something that persistent queues and multiple pipelines can achieve, but there may also be other ways.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.