Search across multiple ES data sources

vasekz · March 12, 2019, 1:32pm

Hi,

I have a hundred of independent ES servers with data of the same structure. For example NY server keeps NY phone book, LA server keeps LA phone book, etc. Is there a way to query this data from a central point? For example, I want to get phone numbers ending by 00 across all cities?

Thanks.

A_B · March 12, 2019, 3:07pm

Hi @vasekz,

which version of Elasticsearch are you running? Later versions of Elasticsearch support Cross Cluster Search.

I would not recommend doing that over the internet but maybe you have some sort of network tunnels between your locations

vasekz · March 13, 2019, 7:53am

The project is still in planning, so I can take the latest version. What issues can arise if I do that over the internet, are them manageable?

A_B · March 13, 2019, 2:32pm

I was thinking mainly about the security implications of exposing Elasticsearch to the internet... Performance will probably take a serious hit as well.

That is a fair number. Although Cross Cluster search is technically possible the config would be pretty messy.

I hope someone else has better suggestions

vasekz · March 13, 2019, 2:41pm

If Elasticsearch is not a must, what would be the best solution? Assume that I have a hundred of servers around the globe that independently produce key-value data. Then I want to perform a global query for first N results across the servers. I thought about constant uploading the data to the cloud, but in this case when the GUI is closed (99% of the time) the system still uploads all the data.

A_B · March 13, 2019, 3:02pm

This is far from anything I have done and I am also not a developer (this sounds like a developer problem) so keep that in mind

If Elasticsearch would be used, a central Elastic Search cluster with possible dedicated indices for each location could work. With a good naming convention for the indices you could target all or a selected group of indices just using index pattern.

Additions/Updates could be made directly to Elastic Search.

vasekz · March 13, 2019, 3:13pm

Thanks for your prompt answers. Personally, I thought it's not a dev problem since there is no code to write, just a configuration. At least I'm trying to avoid coding.

DavidTurner · March 13, 2019, 4:30pm

I don't know why you think a cross-cluster search (CCS) setup would be messy. It sounds reasonable to me. The performance will be limited by the latency and bandwidth between the clusters.

If performance is critical then the other alternative is cross-cluster replication (CCR) which allows you to replicate each index to the other clusters so that, for instance, the NY server has the authoritative copy of the NY phonebook but also has a local copy of the LA phonebook that tracks the authoritative copy that lives on the LA site.

In either case, if you are doing this across the internet you will want to protect your communications using TLS. Here are the docs about TLS and CCS, but I think they mostly also apply to CCR too: Cross cluster search and security | Elasticsearch Guide [6.6] | Elastic

A_B · March 14, 2019, 7:46am

Messy might be the wrong word, you are right. If the CCS is needed only in one place it is definitely manageable. If you need to manage the config in all 100+ independent ES servers, it is possibly a bit more challenging.

vasekz · March 17, 2019, 10:03am

Currently I consider three options

Send everything to some central place (cloud) and then query from the cloud. Pros: fast results. Cons: every change is sent to the cloud even if no GUI is open. Since there are a lot of changes there is constantly heavy traffic.
Keep the data on local ES instances and configure Cross Cluster Search on the central ES node. Pros: no network traffic if no GUI. Cons: Queries take more time as ES should fetch the data from many remote nodes.
Keep the data on a local ES instances and configure each ES instance so that it will have 1 replica on a central server (cloud). Somehow configure the central node so that it will query the replicas. I do not know if this even possible.

Could you suggest how to proceed?

DavidTurner · March 17, 2019, 12:28pm

The choice between options 1 and 2 is purely down to the economics of your use-case. Option 1 (cross-cluster replication) involves more traffic but should deliver results quicker than option 2 (cross-cluster search). You'll probably have to do some benchmarking with a realistic workload to determine exactly what the costs and benefits are.

Option 3 is very much not recommended. Elasticsearch expects connections between nodes in a single cluster to be fairly reliable and to have fairly low latency, as if they were in a single datacenter, and this wouldn't be the case for that setup.

vasekz · March 19, 2019, 7:55am

I managed to successfully configure cross-cluster search. Is there a convenient way to sort the results? Currently I get separated (sorted) results per each node. Am I supposed to merge the results manually? An example of a query is "get first N ppl sorted by their names with phone numbers ending by 00".

DavidTurner · March 19, 2019, 9:27am

If you search multiple indices then you should be getting one set of results, whether the indices you've searched are local or remote. Can you share the actual query you're sending, and an example of the results? I don't really know what you mean by "separated" results.

vasekz · March 19, 2019, 9:37am

The setup:
One node, called 41 (its ip ends by 41) has fieldvasek : valuevasek41
Another node, called 12 has fieldvasek : valuevasek12
The two nodes are connected via the remote cluster configuration. Cluster names are "cluster41" and "cluster12".

I want to write a query that returns sorted results from both nodes (maybe my query is not correct, I admit).

The query:
curl -X GET "localhost:9200/cluster41:indexname,cluster12:indexname/_search" -d "{"query":{"wildcard":{"fieldvasek":"value*"}},"sort":{"fieldvasek":{"order":"asc"}}}" -H 'Content-Type: application/json'

The answer:
{"took":10,"timed_out":false,"_shards":{"total":10,"successful":10,"skipped":0,"failed":0},"_clusters":{"total":2,"successful":2,"skipped":0},"hits":{"total":2,"max_score":null,"hits":[{"_index":"cluster12:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek12"},"sort":["valuevasek12"]},{"_index":"cluster41:indexname","_type":"typename","_id":"optionalUniqueId","_score":null,"_source":{ "fieldvasek" : "valuevasek41"},"sort":["valuevasek41"]}]}}

DavidTurner · March 19, 2019, 10:00am

Reformatting:

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 10,
    "successful": 10,
    "skipped": 0,
    "failed": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "cluster12:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek12"
        },
        "sort": [
          "valuevasek12"
        ]
      },
      {
        "_index": "cluster41:indexname",
        "_type": "typename",
        "_id": "optionalUniqueId",
        "_score": null,
        "_source": {
          "fieldvasek": "valuevasek41"
        },
        "sort": [
          "valuevasek41"
        ]
      }
    ]
  }
}

I still don't see what you mean by "separated". There's just one list of hits.

vasekz · March 20, 2019, 9:03am

You are right doc I misinterpreted the output... still learning

vasekz · April 1, 2019, 7:36am

Given I have tens of thousands of remote nodes, would be the following the right approach?

The task is to get first N sorted results. Due to the large number of the remote nodes it's not possible to use the cross-cluster search (is that correct?)

From a "central" node query remote nodes.
The result received from each node store on the central ES node
Run the query on the central ES node. In my opinion it should give the aggregated result across all the nodes.

DavidTurner · April 1, 2019, 7:51am

I don't think there's any limit on the number of remote clusters involved in a cross-cluster search, but it is true that CCS isn't really designed or tested with this many remote clusters. You might encounter difficulties.

It seems unlikely you need to run tens of thousands of separate Elasticsearch clusters. It's more common to use something lightweight like Beats to feed data from many sources into a smaller number of Elasticsearch clusters.

vasekz · April 2, 2019, 10:17am

>>> I don't think there's any limit
Since the remote ES nodes are listed in the URL their number is limited by a few hundreds (I did not check how much exactly) as the URL is limited by ~2K symbols. Is there any way to list the nodes in the request body?

>>>It's more common to use something lightweight like Beats
Thanks for pointing out on Beats. I was not familiar with this technology. Checking it now. At the first glance it's not suitable since I need a possibility to connect each node locally and get the results in the standalone node context. Therefore, it seems that I have to have ES on each node.

Christian_Dahlqvist · April 2, 2019, 10:33am

I do not understand the reasoning behind this at all. Could you please explain the requirements driving this?

As Elasticsearch can be quite resource intensive, creating a few geographically distributed and highly available clusters that can be queries across CCS is the normal deployment pattern. Why would this not work for you?

Topic		Replies	Views
Search / aggregate across multiple ES clusters? Elasticsearch	2	454	July 6, 2017
Is it possible for two or more different es instances accessing/searching the same cross cluster search remotes? Elasticsearch ccs-cross-cluster-search	3	455	May 7, 2020
Is there a way for multiple clusters to share the same index? Elasticsearch	6	471	April 17, 2018
ES Configuration across geographies Elasticsearch	5	610	July 5, 2017
Effeciency of cross cluster search Elasticsearch	2	444	November 1, 2019

Search across multiple ES data sources

Related topics