Search across multiple ES data sources

Hi,

I have a hundred of independent ES servers with data of the same structure. For example NY server keeps NY phone book, LA server keeps LA phone book, etc. Is there a way to query this data from a central point? For example, I want to get phone numbers ending by 00 across all cities?

Thanks.

Hi @vasekz,

which version of Elasticsearch are you running? Later versions of Elasticsearch support Cross Cluster Search.

I would not recommend doing that over the internet but maybe you have some sort of network tunnels between your locations :slight_smile:

The project is still in planning, so I can take the latest version. What issues can arise if I do that over the internet, are them manageable?

I was thinking mainly about the security implications of exposing Elasticsearch to the internet... Performance will probably take a serious hit as well.

That is a fair number. Although Cross Cluster search is technically possible the config would be pretty messy.

I hope someone else has better suggestions :slight_smile:

If Elasticsearch is not a must, what would be the best solution? Assume that I have a hundred of servers around the globe that independently produce key-value data. Then I want to perform a global query for first N results across the servers. I thought about constant uploading the data to the cloud, but in this case when the GUI is closed (99% of the time) the system still uploads all the data.

This is far from anything I have done and I am also not a developer (this sounds like a developer problem) so keep that in mind :slight_smile:

If Elasticsearch would be used, a central Elastic Search cluster with possible dedicated indices for each location could work. With a good naming convention for the indices you could target all or a selected group of indices just using index pattern.

Additions/Updates could be made directly to Elastic Search.

Thanks for your prompt answers. Personally, I thought it's not a dev problem since there is no code to write, just a configuration. At least I'm trying to avoid coding.

I don't know why you think a cross-cluster search (CCS) setup would be messy. It sounds reasonable to me. The performance will be limited by the latency and bandwidth between the clusters.

If performance is critical then the other alternative is cross-cluster replication (CCR) which allows you to replicate each index to the other clusters so that, for instance, the NY server has the authoritative copy of the NY phonebook but also has a local copy of the LA phonebook that tracks the authoritative copy that lives on the LA site.

In either case, if you are doing this across the internet you will want to protect your communications using TLS. Here are the docs about TLS and CCS, but I think they mostly also apply to CCR too: Cross cluster search and security | Elasticsearch Guide [6.6] | Elastic

Messy might be the wrong word, you are right. If the CCS is needed only in one place it is definitely manageable. If you need to manage the config in all 100+ independent ES servers, it is possibly a bit more challenging.

1 Like

Currently I consider three options

  1. Send everything to some central place (cloud) and then query from the cloud. Pros: fast results. Cons: every change is sent to the cloud even if no GUI is open. Since there are a lot of changes there is constantly heavy traffic.
  2. Keep the data on local ES instances and configure Cross Cluster Search on the central ES node. Pros: no network traffic if no GUI. Cons: Queries take more time as ES should fetch the data from many remote nodes.
  3. Keep the data on a local ES instances and configure each ES instance so that it will have 1 replica on a central server (cloud). Somehow configure the central node so that it will query the replicas. I do not know if this even possible.

Could you suggest how to proceed?

The choice between options 1 and 2 is purely down to the economics of your use-case. Option 1 (cross-cluster replication) involves more traffic but should deliver results quicker than option 2 (cross-cluster search). You'll probably have to do some benchmarking with a realistic workload to determine exactly what the costs and benefits are.

Option 3 is very much not recommended. Elasticsearch expects connections between nodes in a single cluster to be fairly reliable and to have fairly low latency, as if they were in a single datacenter, and this wouldn't be the case for that setup.

I managed to successfully configure cross-cluster search. Is there a convenient way to sort the results? Currently I get separated (sorted) results per each node. Am I supposed to merge the results manually? An example of a query is "get first N ppl sorted by their names with phone numbers ending by 00".

If you search multiple indices then you should be getting one set of results, whether the indices you've searched are local or remote. Can you share the actual query you're sending, and an example of the results? I don't really know what you mean by "separated" results.

The setup:
One node, called 41 (its ip ends by 41) has fieldvasek : valuevasek41
Another node, called 12 has fieldvasek : valuevasek12
The two nodes are connected via the remote cluster configuration. Cluster names are "cluster41" and "cluster12".

I want to write a query that returns sorted results from both nodes (maybe my query is not correct, I admit).

The query:
curl -X GET "localhost:9200/cluster41:indexname,cluster12:indexname/_search" -d "{"query":{"wildcard":{"fieldvasek":"value*"}},"sort":{"fieldvasek":{"order":"asc"}}}" -H 'Content-Type: application/json'

The answer:
{"took":10,"timed_out":false,"_shards":{"total":10,"successful":10,"skipped":0,"failed":0},"_clusters":{"total":2,"successful":2,"skipped":0},"hits":{"total":2,"max_score":null,"hits":[{"_index":"cluster12:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek12"},"sort":["valuevasek12"]},{"_index":"cluster41:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek41"},"sort":["valuevasek41"]}]}}

Reformatting:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 10,
    "successful": 10,
    "skipped": 0,
    "failed": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "cluster12:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek12"
        },
        "sort": [
          "valuevasek12"
        ]
      },
      {
        "_index": "cluster41:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek41"
        },
        "sort": [
          "valuevasek41"
        ]
      }
    ]
  }
}

I still don't see what you mean by "separated". There's just one list of hits.

You are right doc :slight_smile: I misinterpreted the output... still learning

1 Like

Given I have tens of thousands of remote nodes, would be the following the right approach?

The task is to get first N sorted results. Due to the large number of the remote nodes it's not possible to use the cross-cluster search (is that correct?)

  1. From a "central" node query remote nodes.
  2. The result received from each node store on the central ES node
  3. Run the query on the central ES node. In my opinion it should give the aggregated result across all the nodes.

I don't think there's any limit on the number of remote clusters involved in a cross-cluster search, but it is true that CCS isn't really designed or tested with this many remote clusters. You might encounter difficulties.

It seems unlikely you need to run tens of thousands of separate Elasticsearch clusters. It's more common to use something lightweight like Beats to feed data from many sources into a smaller number of Elasticsearch clusters.

>>> I don't think there's any limit
Since the remote ES nodes are listed in the URL their number is limited by a few hundreds (I did not check how much exactly) as the URL is limited by ~2K symbols. Is there any way to list the nodes in the request body?

>>>It's more common to use something lightweight like Beats
Thanks for pointing out on Beats. I was not familiar with this technology. Checking it now. At the first glance it's not suitable since I need a possibility to connect each node locally and get the results in the standalone node context. Therefore, it seems that I have to have ES on each node.

I do not understand the reasoning behind this at all. Could you please explain the requirements driving this?

As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern. Why would this not work for you?