Has anyone done much analysis on the readiness of ElasticSearch for the EU GDPR legislation coming into force next year?
As I understand it, GDPR mandates encryption at rest and encryption in transit. The former can be block level encryption on disk so that's easy. Encryption in transit is harder to retrofit to the host to host connections within an ElasticSearch cluster. Encryption of these links seems to only be available under Gold Shield licenses.
Is ElasticSearch intending to make this feature available in the free version?
It seems that not doing so would mean ElasticSearch could not be used for any system that might contain any sort of personal identifier (which includes user IP addresses, so would include Apache logs) - which precludes almost every use I can think of. Unless every installation is covered by a Gold license which is likely to prohibit the use of ElasticSearch in most cases.
Clarity around this would be helpful as would any suggestions work workarounds to introduce encryption to the server-to-server or client-server connections.
If I can play devils advocate for a bit - if apache is generating this data with PII in it, then how do you make sure it is compliant? Or if you don't use Elasticsearch and just store things in files?
For these sorts of legislative requirements, it really comes down as much to processes than specific technological solutions.
If you are holding PII and you need to restrict access to that, then you may need the Security functionality in X-Pack or build your own out of other tools.
Please do. That was just an example, but let's continue with it for the sake of argument. The apache servers would indeed need to store the logs in a way that meets the encrypted at rest definition (which is fairly loosely defined) and they would need to send to ELK using encryption in transit as well. That's feasible today (encrypted EBS volumes, logstash SSL support) so I believe we can assume that's solved. However, when it hits ES it would potentially be transmitted to another node in the cluster in the clear.
Restricting access is definitely part of the GDPR as well, although that could be done by restricting access to the servers and endpoints in most circumstances so can be achieved by other means without needing the X-Pack.
The thing is that the GDPR redefines PII to be much broader (as I said, apache access logs would now be considered PII) and the GDPR applies to more organisations (processors as well as controllers). In theory the GDPR "applies to processing carried out by organisations operating within the EU. It also applies to organisations outside the EU that offer goods or services to individuals in the EU". Under a liberal reading that would mean any website that serves EU citizens needs to make sure that their logs are never stored or transmitted in the clear. In reality any organisation without a presence in the EU is unlikely to be pursued, but anyone that does can be fined up to 4% of global turnover.
Encryption is specifically mentioned and GDPR states that all PII should be encrypted in transit and at rest - as such it would seem to be a new sort of legislative requirement that does mandate a technological solution. Definitely open to other readings though.
Many use cases of the ElasticSearch ecosystem (I would say the vast majority) would fall under the GDPR. Without encryption being a standard or easily added free feature people, it would seem to not be GDPR compliant and so would no longer be a candidate for projects.
Some clarity around elastic's plans for GDPR compliance would be gratefully received so that users can, if necessary, start planning their migrations away from ES when there is no budget for support plans.
I think it's going to be tricky to get a statement on GDPR compliance/status because that effectively implies legal advice around a very tricky, complex legal topic as you're obviously aware. This forum is open to the community and we Elastic employees are here to provide technical advice and knowledge (rather than legal/compliance knowledge/advice) so it may be better to speak to that and let you and your legal team take that and interpret it. If a lawyer is on here and wants offer legal advice, then by all means.
For me, I can speak to how I've heard other customers deal with GDPR and what we have.
Some users I've heard have said "it's against our policy to store PII data at all." They scrub data as it's getting parsed. There are things like the anonymize filter and the cipher filter, or you could simply remove sensitive fields in logstash. I've heard of some users with a centralized logging infrastructure for their company go a step further and by defining a schema that their shippers have to adhere to and using Elasticsearch strict mapping to prevent unwanted (and potentially PII-sensitive) data from showing up. Some even go a step further to inspect text data that's being shipped for things that look like PII, even if it's not in a field that's known to contain PII. I've heard of them doing this through various regular expressions, dictionary lookups, opennlp, and others.
There are various hardware and software components as well as some cloud/datacenter providers that can provide layer 2 / layer 3 encryption. There may be costs and/or labor to these, but I've talked to some organizations have done this for years as their default deployments of any networked/clustered software
Elastic provides X-Pack, which can easily provide encryption as it sounds your aware. Of course, there is a cost to X-Pack as well.
We will re-evaluate what's in X-Pack and what's outside of X-Pack at various points, but my understanding (not to be taken as legal advice ) is that it is possible to stand up a GDPR-compliant system without X-Pack, though it may require more work and/or have other costs and/or restrict the type of data you process or the flow of it.
Completely ignoring the Elastic Stack and just making a general statement: GDPR is obviously a big, complicated thing that is likely to cost organizations a lot of time and money to think and work through. It may not be possible for some organizations to successfully store PII or the cost of doing so may be prohibitive and in those cases the organization may choose to simply not store PII. Various systems including a variety of databases (including MySQL NDB clusters) don't provide encryption at all. Some others (Elasticsearch, MongoDB, and others) provide transport encryption in enterprise licenses. Organizations are going to have to review these softwares, scrubbing options for PII data (and the value of that PII data to them), encryption at various network layers and the costs/time to deploy those, and other workflows with their legal teams to evaluate the overall time, costs, and feasibilities. I'm also sure various software providers will adjust and provide more proactive advice, architectures, and features over time.
Take a look at https://floragunn.com/ they provide free Elasticsearch plugin for "encryption, authentication, authorization, audit logging and multi tenancy". But it should be carefully checked and tested before you deploy it in production. As it third party plugin with it's own implementation and possibly own bugs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.