Help a non-profit STIR the data

Hi, we're a Mexican NGO working all across latin-america to stop corporate abuse, you can check out our bilingual website here: https://poderlatam.org

Our most intensive data project is about public contracts, it has even won a Sigma Awards Data Journalism price (which we couldn't actually collect because of the pandemic, but still). We already have 5 million documents and counting. Our data ingestion engine is based in Apache NiFi and it's working quite fine, that is, until we have to move the data from the processing cluster to the production cluster. In our production cluster we have a custom API that exposes the data with an open license for reuse and also for our own apps.

We need a Secure Transactional Inter-cluster Replication (STIR) setup using Elastic and we're only two developers working on this project (and many others!).

There are several challenges with this project right now, the main one being moving the data securely, it's about 20gb per week. We would like to set-up authentication in our cluster -for added Security- and also to allow for Inter-cluster replication. It's important to note that we don't want to automatically replicate our data from the processing to the production cluster, because we need to check for errors before, so this is what we mean by Transactional.

We tried and failed to set-up security in our kubernetes-based cluster following the official docs. And since we don't want to expose the unsecured Elastics to the internet, our current process is just a manual dump and import of data. To avoid downtime, we're thinking of having double indexes in production, and then use index aliases to switch from one to the other after the import is done.

Any help in improving this setup would be greatly appreciated.

Please note that our 2020 budget is long gone (otherwise we would hire the same consultants that help us create the cluster in the first place), but maybe next year, if continue to work together, we can talk business.

Thanks in advance, Martín and Fernando from PODER.

1 Like

Hi, On what license and version are you currently?

Hello willemdh, thanks for the question. We're running 7.8 on a basic license.

Just a thought, you are talking about inter-cluster replication, but isn't it more cross-cluster replication? Unfortunately that is not available for basic license, only for platinum and gold.

1 Like

Thanks for the info. We tried to ask for a discount last year on the license, but it was still more than we could pay.

But even if we don't have the native version of the replication, we still need to do it. We have also been partly successful in replicating the Graph API using Term aggregations.

I have some specific questions then:
Can I instruct the copying of an index from another cluster in a way that is allowed by my license?
Maybe the only way to copy is to do a full dump and restore from outside the cluster?
Or is there an internal way to see the latest modified documents and copy just those?

Security is still an issue, though, we spent a couple of weeks trying to create the certificates and we always ended up short. Do you know of any material that specifically addresses running a secured Elastic cluster on Kubernetes?

We're on Kubernetes 1.14, planning to update to 1.19. Would there be any difference among those?

If you can provide specific details about the problems you ran into then we can probably help you work through them.

You have two viable options:

  1. Reindex from remote
  2. Snapshot and restore

Elastic Cloud on Kubernetes is our official k8s operator for running the Elastic Stack. The clusters that it provisions always have security enabled, and it's available on a basic license.

1 Like

Hi Tim! Fernando from PODER here.
In response to your first question, we tried to follow the official documentation step by step:

https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-security.html https://www.elastic.co/guide/en/elasticsearch/reference/current/configuring-tls-docker.html

We got a bit stuck trying to understand the proper way to generate a certificate for each node and then assign them. First we tried using the same certificate for all nodes, then manually generating a different certificate and assigning to different nodes. Eventually we were able to generate certificates using a script that runs on each deploy and storing them as a secret. But the nodes weren't able to talk to each other securely. The error we always got was "WARN: received plaintext http traffic on a https channel, closing connection". This error happened always no matter which path we took to try to configure security.

Our Kubernetes cluster has 3 Elasticsearch nodes, it is deployed using Jenkins, and runs on Digitalocean. This provider doesn't seem to be supported by ECK, can you confirm? We would rather not switch cloud providers.

I can provide any more details that you might need. Thank you very much!

Hello, there's an update to this situation. We have now been able to encrypt communications between the nodes of one of our clusters, this was done by creating the configuration in the Dockerfile before deployment.

We have also been able to set-up aliases that allow us to switch from one index to the other while the data is updated.

We have resolved the S part of the STIR. The rest of the needs are still standing: we need a way to Replicate data Transactionally from one cluster to another without downloading and re-uploading manually each time.

Thanks for your help.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.