Search across multiple ES data sources

(vasekz) #5

If Elasticsearch is not a must, what would be the best solution? Assume that I have a hundred of servers around the globe that independently produce key-value data. Then I want to perform a global query for first N results across the servers. I thought about constant uploading the data to the cloud, but in this case when the GUI is closed (99% of the time) the system still uploads all the data.

#6

This is far from anything I have done and I am also not a developer (this sounds like a developer problem) so keep that in mind :slight_smile:

If Elasticsearch would be used, a central Elastic Search cluster with possible dedicated indices for each location could work. With a good naming convention for the indices you could target all or a selected group of indices just using index pattern.

Additions/Updates could be made directly to Elastic Search.

(vasekz) #7

Thanks for your prompt answers. Personally, I thought it's not a dev problem since there is no code to write, just a configuration. At least I'm trying to avoid coding.

(David Turner) #8

I don't know why you think a cross-cluster search (CCS) setup would be messy. It sounds reasonable to me. The performance will be limited by the latency and bandwidth between the clusters.

If performance is critical then the other alternative is cross-cluster replication (CCR) which allows you to replicate each index to the other clusters so that, for instance, the NY server has the authoritative copy of the NY phonebook but also has a local copy of the LA phonebook that tracks the authoritative copy that lives on the LA site.

In either case, if you are doing this across the internet you will want to protect your communications using TLS. Here are the docs about TLS and CCS, but I think they mostly also apply to CCR too: https://www.elastic.co/guide/en/elastic-stack-overview/6.6/cross-cluster-configuring.html

#9

Messy might be the wrong word, you are right. If the CCS is needed only in one place it is definitely manageable. If you need to manage the config in all 100+ independent ES servers, it is possibly a bit more challenging.

1 Like
(vasekz) #10

Currently I consider three options

  1. Send everything to some central place (cloud) and then query from the cloud. Pros: fast results. Cons: every change is sent to the cloud even if no GUI is open. Since there are a lot of changes there is constantly heavy traffic.
  2. Keep the data on local ES instances and configure Cross Cluster Search on the central ES node. Pros: no network traffic if no GUI. Cons: Queries take more time as ES should fetch the data from many remote nodes.
  3. Keep the data on a local ES instances and configure each ES instance so that it will have 1 replica on a central server (cloud). Somehow configure the central node so that it will query the replicas. I do not know if this even possible.

Could you suggest how to proceed?

(David Turner) #11

The choice between options 1 and 2 is purely down to the economics of your use-case. Option 1 (cross-cluster replication) involves more traffic but should deliver results quicker than option 2 (cross-cluster search). You'll probably have to do some benchmarking with a realistic workload to determine exactly what the costs and benefits are.

Option 3 is very much not recommended. Elasticsearch expects connections between nodes in a single cluster to be fairly reliable and to have fairly low latency, as if they were in a single datacenter, and this wouldn't be the case for that setup.

(vasekz) #12

I managed to successfully configure cross-cluster search. Is there a convenient way to sort the results? Currently I get separated (sorted) results per each node. Am I supposed to merge the results manually? An example of a query is "get first N ppl sorted by their names with phone numbers ending by 00".

(David Turner) #13

If you search multiple indices then you should be getting one set of results, whether the indices you've searched are local or remote. Can you share the actual query you're sending, and an example of the results? I don't really know what you mean by "separated" results.

(vasekz) #14

The setup:
One node, called 41 (its ip ends by 41) has fieldvasek : valuevasek41
Another node, called 12 has fieldvasek : valuevasek12
The two nodes are connected via the remote cluster configuration. Cluster names are "cluster41" and "cluster12".

I want to write a query that returns sorted results from both nodes (maybe my query is not correct, I admit).

The query:
curl -X GET "localhost:9200/cluster41:indexname,cluster12:indexname/_search" -d "{"query":{"wildcard":{"fieldvasek":"value*"}},"sort":{"fieldvasek":{"order":"asc"}}}" -H 'Content-Type: application/json'

The answer:
{"took":10,"timed_out":false,"_shards":{"total":10,"successful":10,"skipped":0,"failed":0},"_clusters":{"total":2,"successful":2,"skipped":0},"hits":{"total":2,"max_score":null,"hits":[{"_index":"cluster12:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek12"},"sort":["valuevasek12"]},{"_index":"cluster41:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek41"},"sort":["valuevasek41"]}]}}

(David Turner) #15

Reformatting:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 10,
    "successful": 10,
    "skipped": 0,
    "failed": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "cluster12:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek12"
        },
        "sort": [
          "valuevasek12"
        ]
      },
      {
        "_index": "cluster41:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek41"
        },
        "sort": [
          "valuevasek41"
        ]
      }
    ]
  }
}

I still don't see what you mean by "separated". There's just one list of hits.

(vasekz) #16

You are right doc :slight_smile: I misinterpreted the output... still learning

1 Like
(vasekz) #17

Given I have tens of thousands of remote nodes, would be the following the right approach?

The task is to get first N sorted results. Due to the large number of the remote nodes it's not possible to use the cross-cluster search (is that correct?)

  1. From a "central" node query remote nodes.
  2. The result received from each node store on the central ES node
  3. Run the query on the central ES node. In my opinion it should give the aggregated result across all the nodes.
(David Turner) #18

I don't think there's any limit on the number of remote clusters involved in a cross-cluster search, but it is true that CCS isn't really designed or tested with this many remote clusters. You might encounter difficulties.

It seems unlikely you need to run tens of thousands of separate Elasticsearch clusters. It's more common to use something lightweight like Beats to feed data from many sources into a smaller number of Elasticsearch clusters.

(vasekz) #19

>>> I don't think there's any limit
Since the remote ES nodes are listed in the URL their number is limited by a few hundreds (I did not check how much exactly) as the URL is limited by ~2K symbols. Is there any way to list the nodes in the request body?

>>>It's more common to use something lightweight like Beats
Thanks for pointing out on Beats. I was not familiar with this technology. Checking it now. At the first glance it's not suitable since I need a possibility to connect each node locally and get the results in the standalone node context. Therefore, it seems that I have to have ES on each node.

(Christian Dahlqvist) #20

I do not understand the reasoning behind this at all. Could you please explain the requirements driving this?

As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern. Why would this not work for you?

(David Turner) #21

TIL there is a limit. It defaults to 4kB but can be adjusted by changing http.max_initial_line_length. If you are seeing a limit of ~2k then this is being imposed by something outside of Elasticsearch.

Also I agree with what @Christian_Dahlqvist said. Your proposed architecture is very strange.

(vasekz) #22

The requirements are more-or-less like this:

There are hundreds, possibly tens of thousands, depending on the deployment, devices that report data. The devices are deployed on premise. I need to perform queries and analytics on this data at some central place and also locally in the standalone device context. Each device produces up to 100K key-value pairs, where value is itself a map of ~200 key-value pairs, where each such value is a string of size 100 bytes in average.

(vasekz) #23

Please clarify why it's strange.
Regarding 2K limit, it's a browser's limit and I cannot ask the customer to change it. More precisely, I would prefer to to ask until it's absolutely necessary.

(vasekz) #24

Well, I understand that URL is eventually not limited.

As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern
Does this mean that a large number of nodes for CCS is not acceptable?
How about my algorithm above that works essentially the same (if it is :slight_smile: ) as CCS but without it?