TIL there is a limit. It defaults to 4kB but can be adjusted by changing http.max_initial_line_length. If you are seeing a limit of ~2k then this is being imposed by something outside of Elasticsearch.
Also I agree with what @Christian_Dahlqvist said. Your proposed architecture is very strange.
There are hundreds, possibly tens of thousands, depending on the deployment, devices that report data. The devices are deployed on premise. I need to perform queries and analytics on this data at some central place and also locally in the standalone device context. Each device produces up to 100K key-value pairs, where value is itself a map of ~200 key-value pairs, where each such value is a string of size 100 bytes in average.
Please clarify why it's strange.
Regarding 2K limit, it's a browser's limit and I cannot ask the customer to change it. More precisely, I would prefer to to ask until it's absolutely necessary.
Well, I understand that URL is eventually not limited.
As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern
Does this mean that a large number of nodes for CCS is not acceptable?
How about my algorithm above that works essentially the same (if it is ) as CCS but without it?
Elasticsearch is designed to be a server component and requires a fair bit of resources. Are the devices able to support running Elasticsearch locally?
Sounds like a very strange way to deploy Elasticsearch and I have seen anything that comes even remotely close. I would therefore recommend against deploying it this way as you are likely to face issues and potential limitations that no one has come across before.
What would you suggest then? Nodes are not necessary connected to each other via the network. They produce the data independently and I have to present an aggregated view. The connection between the nodes and the central point is promised.
Without more details about the devices and their connectivity patterns, which seems to affect the design I would recommend what has already been recommended - one or more centralized clusters that the devices feed into.
You might also have local nodes that only ever are used for local analysis and also send the data to a centralized cluster. This duplicates the data but might give you a way to efficiently serve both modes. I have seen this kind of pattern work when you have isolated platforms with limitations on network connectivity and bandwidth, e.g. servers hosted on oil platforms etc.
I understand your point. This has been suggested above as well. The disadvantages I see are
A constant flow of data even if the GUI is closed.
The central point is on the customer's site, not in the cloud. Therefore, I would have to request hardware enough for running a cluster capable to support all those millions of records, including constant updates. Also, I need to establish, configure and maintain the cluster. Not sure it's a difficult task, but it's not an insignificant one.
Since I have to deploy ES on each node anyway, why not to use my algorithm below?
From the "central" node query remote nodes.
The result received from each node store on the central ES node
Run the same query on the central ES node. In my opinion it should give the aggregated result across all the nodes.
I suspect CCS will not scale to that number of clusters and even if it did I would expect you to experience very poor query performance as the slowest remote cluster potentially would dictate latency. If any cluster was temporarily unavailable you could also see errors and/or partial results.
How are you going to query the data? Are you going to use Kibana? As far as I can tell that approach would still suffer from the same drawbacks but also require s custom UI as Kibana does not support anything like that.
Given that it is an unusual approach I would not be surprised if you ran into the type of problems I described earlier plus some novel problems, limitations or edge cases that no one might have encountered before.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.