Search across multiple ES data sources

DavidTurner · April 2, 2019, 10:34am

TIL there is a limit. It defaults to 4kB but can be adjusted by changing http.max_initial_line_length. If you are seeing a limit of ~2k then this is being imposed by something outside of Elasticsearch.

Also I agree with what @Christian_Dahlqvist said. Your proposed architecture is very strange.

vasekz · April 2, 2019, 10:41am

The requirements are more-or-less like this:

There are hundreds, possibly tens of thousands, depending on the deployment, devices that report data. The devices are deployed on premise. I need to perform queries and analytics on this data at some central place and also locally in the standalone device context. Each device produces up to 100K key-value pairs, where value is itself a map of ~200 key-value pairs, where each such value is a string of size 100 bytes in average.

vasekz · April 2, 2019, 10:47am

Please clarify why it's strange.
Regarding 2K limit, it's a browser's limit and I cannot ask the customer to change it. More precisely, I would prefer to to ask until it's absolutely necessary.

vasekz · April 2, 2019, 11:01am

Well, I understand that URL is eventually not limited.

As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern
Does this mean that a large number of nodes for CCS is not acceptable?
How about my algorithm above that works essentially the same (if it is ) as CCS but without it?

Christian_Dahlqvist · April 2, 2019, 11:06am

Elasticsearch is designed to be a server component and requires a fair bit of resources. Are the devices able to support running Elasticsearch locally?

vasekz · April 2, 2019, 11:06am

yes.

Christian_Dahlqvist · April 2, 2019, 11:11am

Sounds like a very strange way to deploy Elasticsearch and I have seen anything that comes even remotely close. I would therefore recommend against deploying it this way as you are likely to face issues and potential limitations that no one has come across before.

vasekz · April 2, 2019, 11:13am

I do not understand...
Is the number of nodes too large? What is strange?

Christian_Dahlqvist · April 2, 2019, 11:15am

Trying to connect a very large number of single node clusters is not a common deployment pattern.

vasekz · April 2, 2019, 11:17am

What would you suggest then? Nodes are not necessary connected to each other via the network. They produce the data independently and I have to present an aggregated view. The connection between the nodes and the central point is promised.

Christian_Dahlqvist · April 2, 2019, 11:21am

Without more details about the devices and their connectivity patterns, which seems to affect the design I would recommend what has already been recommended - one or more centralized clusters that the devices feed into.

You might also have local nodes that only ever are used for local analysis and also send the data to a centralized cluster. This duplicates the data but might give you a way to efficiently serve both modes. I have seen this kind of pattern work when you have isolated platforms with limitations on network connectivity and bandwidth, e.g. servers hosted on oil platforms etc.

vasekz · April 2, 2019, 11:48am

I understand your point. This has been suggested above as well. The disadvantages I see are

A constant flow of data even if the GUI is closed.
The central point is on the customer's site, not in the cloud. Therefore, I would have to request hardware enough for running a cluster capable to support all those millions of records, including constant updates. Also, I need to establish, configure and maintain the cluster. Not sure it's a difficult task, but it's not an insignificant one.
Since I have to deploy ES on each node anyway, why not to use my algorithm below?

From the "central" node query remote nodes.
The result received from each node store on the central ES node
Run the same query on the central ES node. In my opinion it should give the aggregated result across all the nodes.

Christian_Dahlqvist · April 2, 2019, 11:52am

I suspect CCS will not scale to that number of clusters and even if it did I would expect you to experience very poor query performance as the slowest remote cluster potentially would dictate latency. If any cluster was temporarily unavailable you could also see errors and/or partial results.

vasekz · April 2, 2019, 12:12pm

I totally agree with you. That's why I suggest the algorithm above. Is it a good approach?

Christian_Dahlqvist · April 2, 2019, 12:43pm

How are you going to query the data? Are you going to use Kibana? As far as I can tell that approach would still suffer from the same drawbacks but also require s custom UI as Kibana does not support anything like that.

vasekz · April 2, 2019, 12:45pm

I'm going to run HTTP requests via my Java-based server. Then I present the aggregated results on the GUI. No plans for Kibana right now.

Christian_Dahlqvist · April 2, 2019, 12:50pm

I still think you are going to have a lot of problems with such approach and stick with my recommendation. Good luck!

vasekz · April 2, 2019, 12:51pm

But could you clarify what kind of problems?

Christian_Dahlqvist · April 2, 2019, 1:47pm

Given that it is an unusual approach I would not be surprised if you ran into the type of problems I described earlier plus some novel problems, limitations or edge cases that no one might have encountered before.

vasekz · April 2, 2019, 1:48pm

I still cannot figure out what problem could be with my algorithm. Can you be more specific?

Topic		Replies	Views
Search / aggregate across multiple ES clusters? Elasticsearch	2	452	July 6, 2017
Is it possible for two or more different es instances accessing/searching the same cross cluster search remotes? Elasticsearch ccs-cross-cluster-search	3	455	May 7, 2020
Is there a way for multiple clusters to share the same index? Elasticsearch	6	464	April 17, 2018
ES Configuration across geographies Elasticsearch	5	610	July 5, 2017
Effeciency of cross cluster search Elasticsearch	2	444	November 1, 2019

Search across multiple ES data sources

Related topics