Plugin security policies not reliably being automatically loaded?

We have been having an interesting issue with the repository-hdfs plugin on our Elasticsearch cluster. It seems as though the plugin security policy is not "sticking" - that is the best way I can describe this confusing behavior.

What happens is, we start the Elasticsearch cluster. Then, we create a new repository:

curl -XPUT --data-binary @- localhost:9200/_snapshot/new_repo <<JSON
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://hdfs-host.example.com:8020/",
    "path": "/path/to/snapshot/dir"
    "conf_location": "/path/to/hdfs-site.xml"
  }
}
JSON

Then we attempt to verify it:

curl -XPOST localhost:9200/_snapshot/new_repo/_verify

The first time we run this on our node, we get an acknowledgement from Elasticsearch and the repository verification also succeeds. However, creating a snapshot subsequently fails with this error:

2017-07-12_19:00:49.25873 [2017-07-12T19:00:49,211][WARN ][r.suppressed             ] path: /_snapshot/testrepo/snapshot1, params: {repository=testrepo, wait_for_completion=true, snapshot=snapshot1}
2017-07-12_19:00:49.25876 org.elasticsearch.repositories.RepositoryException: [testrepo] could not read repository data from index blob
2017-07-12_19:00:49.25876 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:800) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25876 	at org.elasticsearch.snapshots.SnapshotsService.createSnapshot(SnapshotsService.java:236) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25876 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:82) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25877 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction.masterOperation(TransportCreateSnapshotAction.java:41) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25877 	at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25878 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) ~[elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25878 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25878 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.3.jar:5.4.3]
2017-07-12_19:00:49.25878 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
2017-07-12_19:00:49.25879 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
2017-07-12_19:00:49.25879 	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
2017-07-12_19:00:49.25879 Caused by: java.io.IOException: com.google.protobuf.ServiceException: java.security.AccessControlException: access denied ("javax.security.auth.PrivateCredentialPermission" "org.apache.hadoop.security.Credentials" "read")
2017-07-12_19:00:49.25879 	at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) ~[?:?]
2017-07-12_19:00:49.25880 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:580) ~[?:?]
2017-07-12_19:00:49.25880 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
2017-07-12_19:00:49.25880 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
2017-07-12_19:00:49.25881 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
2017-07-12_19:00:49.25881 	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
2017-07-12_19:00:49.25881 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) ~[?:?]
2017-07-12_19:00:49.25881 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[?:?]
2017-07-12_19:00:49.25881 	at com.sun.proxy.$Proxy34.getListing(Unknown Source) ~[?:?]
2017-07-12_19:00:49.25882 	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2094) ~[?:?]
2017-07-12_19:00:49.25883 	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2077) ~[?:?]
2017-07-12_19:00:49.25883 	at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:254) ~[?:?]
2017-07-12_19:00:49.25883 	at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1798) ~[?:?]
2017-07-12_19:00:49.25884 	at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1794) ~[?:?]
2017-07-12_19:00:49.25884 	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) ~[?:?]
2017-07-12_19:00:49.25884 	at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1800) ~[?:?]
2017-07-12_19:00:49.25884 	at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1759) ~[?:?]
2017-07-12_19:00:49.25884 	at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1718) ~[?:?]
2017-07-12_19:00:49.25885 	at org.elasticsearch.repositories.hdfs.HdfsBlobContainer$6.run(HdfsBlobContainer.java:145) ~[?:?]
2017-07-12_19:00:49.25885 	at org.elasticsearch.repositories.hdfs.HdfsBlobContainer$6.run(HdfsBlobContainer.java:142) ~[?:?]
2017-07-12_19:00:49.25885 	at org.elasticsearch.repositories.hdfs.HdfsBlobStore.lambda$execute$0(HdfsBlobStore.java:132) ~[?:?]
2017-07-12_19:00:49.25885 	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_112]
2017-07-12_19:00:49.25885 	at java.security.AccessController.doPrivileged(AccessController.java:713) ~[?:1.8.0_112]
...

Furthermore, if we restart Elasticsearch, repository verification fails with a similar security exception.

The only way I've found to fix these errors is to manually specify an override to point the java process directly to the security policy file, via -Djava.security.policy=$ES_HOME/plugins/repository-hdfs/plugin-security.policy... which seems wrong.

After looking through the code, it looks like these policy files should be automatically loaded and applied when using the plugins. And, I have been unable to find other users who are experiencing the same error, which makes me think that there is something peculiar with our particular setup or environment, but I haven't found anything that seems fishy.

Has anybody else experienced a similar error? Are any of my assumptions about how the plugin architecture works incorrect? And, what else could I provide to help debug this particular error?

Thanks very much in advance.

I just looked briefly but I wonder if this is related to https://github.com/elastic/elasticsearch/pull/23439/ what version of ES are you running?

@Jason_Anderson Have you seen this behavior happen on any other plugins or is it only happening on the HDFS Plugin? Are you executing against a secured HDFS deployment?

@s1monw That's what I had suspected as well at first, as I was on 5.3.0. I made sure to upgrade everything and I'm now running 5.4.3 on my test cluster, but I have the same issue as before. Although I am testing against a development HDFS setup that does not have security enabled on it.

@james.baiera I haven't seen it with any other plugins, but I'm also not using any other plugins that I know of. So it could very well be affecting all plugins. As mentioned to Simon, I'm not hitting a secured HDFS deployment.

The fact that it sometimes works is particularly odd, and leads me to believe there's something wonky with how the policy files themselves are loaded, or some idiosyncrasy with my current environment.

Update: one other thing that may be of note is that I'm running elasticsearch out of /srv/elasticsearch, which is symlinked to /srv/elasticsearch-5.4.3. I saw other issues reported in the past relating to symlink resolution. The docs specify that it should be allowed if your home path or config path is a symlink, so I figured I was safe.

@Jason_Anderson Could you include the full logs from the node that is failing? I'd also be interested if you are able to reproduce this with a different plugin or if it is just entirely an HDFS Plugin issue.

@Jason_Anderson I just about to create a separate thread describing this issue when I came across this post. I am facing the exact same issue initially on 5.3.0 but since then I upgraded to 5.4.2 and this issue still remains.

I have tried installing the repository-s3 plugin and have not run into issues with the plugin security policy. It works as expected. Also, I am not running a secured HDFS setup so the github issue shouldn't apply to my setup as well (unless this bug is not isolated to just secure HDFS setups)
;
Is there anyone looking into this?

@james.baiera If you still need logs, I can provide them privately.

Thanks!

@james.baiera @Pacer_Girl, I compiled a test-case that would reliably fail on my infrastructure. It creates a repository, verifies it, creates a snapshot, then deletes the repository. That typically all works fine. But, if I re-run the test again, it fails on snapshot creation reliably. I hope it helps diagnose the root cause.

@Jason_Anderson Thanks for putting this together, hopefully we get some kind of feedback soon.

@Jason_Anderson Thanks for the reproduction steps, but I'm having a hard time getting it to reproduce for myself locally on v5.4.3. Are you able to get the issue to reproduce with a locally installed instance instead of your deployment environment? This might help narrow down which factors are causing the problem. Could you also include your HDFS version that you are trying to connect with?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.