Search across multiple ES data sources

Christian_Dahlqvist · April 3, 2019, 4:43am

I can not predict exactly what problems you will face, but suspect they will arise from where your theoretical model meets real world problems and software limitations. One example is CCS, which was never designed to work at this scale. If you find issues around this in Elasticsearch I suspect fixing those might not be considered high priority assuming that the feature scales to the level that was intended and this is a very unusual edge case. You may therefore end up having to query all clusters in parallel from your application, and if you go down that route I suspect you could face issues with performance and network reliability as it is a very large distributed system.

The algorithm you describe seems to be a standard scatter-gather, and there is not anything wrong with this. Getting something like this to work at scale across a large distributed system with good performance is however quite challenging in practice.

I would expect you to save yourself a lot of potential trouble by going with a more standard approach even if that requires a bit more hardware.

vasekz · April 3, 2019, 7:16am

Yes, agree with you... this could be challenging. So, summarizing, I create an ES instance on each node for local querying and at the same time send all the data to the central cluster (or several clusters). Then the user queries the central cluster(s) only. Sounds good.
Is there a standard way to sync local nodes with the central one, so that the data will be consistent in case of network disconnections? In my environment the disconnections could be long, even several days. Then the GUI just shows that this particular node is not available. The problem is that when the node re-connects it should somehow sync the missing updates. I can buffer the updates on the node for a short time, say hours, to handle spontaneous disconnections, but not for days. From David's answer above I understand that defining a replica on the central place for each node is not a good solution. What would be the right approach then?

Christian_Dahlqvist · April 3, 2019, 8:56am

I would recommend writing to the local node and the central cluster in parallel, but if you have lengthy disconnections this means you need to enquire the data locally until the connection is established. This is something that persistent queues and multiple pipelines can achieve, but there may also be other ways.

vasekz · April 3, 2019, 9:46am

Thanks a lot! Appreciate your patience and detailed answers!

system · May 1, 2019, 9:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search / aggregate across multiple ES clusters? Elasticsearch	2	452	July 6, 2017
Is it possible for two or more different es instances accessing/searching the same cross cluster search remotes? Elasticsearch ccs-cross-cluster-search	3	455	May 7, 2020
Is there a way for multiple clusters to share the same index? Elasticsearch	6	464	April 17, 2018
ES Configuration across geographies Elasticsearch	5	610	July 5, 2017
Effeciency of cross cluster search Elasticsearch	2	444	November 1, 2019

Search across multiple ES data sources

Related topics