Unable to send data using FSCrawler to ElasticSearch


(Sachin) #1

Hi,

I am newbie to Elasticsearch, so please bear with me.

I am trying to index documents (DOC/PDFs/PPT etc.) using FSCrawler. When i tried on a machine which FSCrawler hosted and ElasticSearch running on the same machine, documents were getting Indexed perfectly fine. But now when i am trying to run FSCrawler from another machine and trying to push data to ELasticSearch, hosted on a different server, i am getting an error message "ElasticsearchClientManager] failed to create elasticsearch client, disabling crawler..." - this should ideally mean ElasticSearch is not running on the host specified in _settings.json file but i have double checked to ensure ElasticSearch engine is running on the remote server. Below is my _settings.json file content. Am I missing anything here?

</>
{
"name" : "test_job",
"fs" : {
"url" : "d:\docs",
"update_rate" : "2m",
"excludes" : [ "~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "",
"port" : ####,
"username" : "",
"password" : "" ,
"scheme" : "HTTPS"
} ],
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
</>

Appreciate inputs on this.


(Sachin) #2

FYI - for security reasons, i have not mentioned Hostname and credentials information above


(Sachin) #3

@dadoonet : I have been following lot of discussion forums where you have given great suggestion related related to FSCrawler and with that help i was able to make good progress on this. Appreciate your inputs on this issue


(Sachin) #4

I guess i found what is the issue here - ElastSearch host that i am trying to connect accepts SSL connections only and hence, need to supply certificate file but when i try to specify CACERT path in _settings.json file, it does not seems like it parsing through that parameter - probably i am not using the right parameter name. I am using parameter name "cacert" : ""

Any ideas what is the right way of doing this?


(Sachin) #5

I tried invoking FSCrawler as API but that also fails to start complaining about SSL - Any idea which parameter should go into "_settings" if ElasticSearch is running on a remote server? Below is the Debug output while invokig FSCrawler as API

D:\downloads\fscrawler-2.4\bin>fscrawler --config_dir d:\downloads\fscrawler-2.4\jobs ce-elk-doc-test-job --loop 0 --rest --debug

00:43:41,517 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
00:43:41,533 DEBUG [f.p.e.c.f.FsCrawler] Starting job [ce-elk-doc-test-job]...
00:43:41,642 INFO [f.p.e.c.f.FsCrawler] Password for engls_ce_elk:
jelly02fi$h
00:43:48,580 WARN [f.p.e.c.f.c.ElasticsearchClientManager] failed to create elasticsearch client, disabling crawler...
00:43:48,580 FATAL [f.p.e.c.f.FsCrawler] Fatal error received while running the crawler: [General SSLEngine problem]
00:43:48,580 DEBUG [f.p.e.c.f.FsCrawler] error caught
javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) ~[?:1.8.0_131]
at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doWrap(SSLIOSession.java:265) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-
beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:305) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6
.0.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:509) ~[elasticsearch-rest-client-6.0.0-beta1.j
ar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120) ~[elasticsearch-rest-client-6.0.0
-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6
.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[elasticsearch-rest-client-6.0.0
-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[elasticsearch-rest-client-6.0.
0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[elasticsearch-rest-client-6.0.0-beta
1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.
0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ~[elasticsear
ch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) ~[?:1.8.0_131]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:283) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0
.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:353) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6
.0.0-beta1]
... 9 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) ~[?:1.8.0_131]
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:230) ~[?:1.8.0_131]
at sun.security.validator.Validator.validate(Validator.java:260) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:281) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1501) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) ~[?:1.8.0_131]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:283) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0
.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:353) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6
.0.0-beta1]
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
indent preformatted text by 4 spaces


(David Pilato) #6

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Here you are may be using HTTPS?


(Sachin) #10

Just to summarize on the setup: I have installed FSCrawler on one server, which also has the documents repository that needs to be indexes. I have ElasticSearch running on another set of servers (master/data nodes) which accepts HTTPS connections. I am trying to run FSCrawler as REST service

Below is the settings.json file

{
"name" : "ce-elk-doc-test-job",
"fs" : {
"url" : "d:\docs",
"update_rate" : "2m",
"excludes" : [ "~*" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" :
{
"nodes" :
[
{"host" : "elasticsearch hostname", "port" : 9200, "scheme" : "HTTPS" }
],
"bulk_size" : 100,
"flush_interval" : "5s",
"username" : "engls_ce_elk",
},
"rest" : {
"scheme" : "HTTP",
"host" : "localhost",
"port" : 8080,
"endpoint" : "fscrawler",
}
}

and here is DEBUG output

22:25:27,620 FATAL [f.p.e.c.f.FsCrawler] Fatal error received while running the crawler: [General SSLEngine problem]
22:25:27,620 DEBUG [f.p.e.c.f.FsCrawler] error caught
javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1478) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.writeAppRecord(SSLEngineImpl.java:1214) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.wrap(SSLEngineImpl.java:1186) ~[?:1.8.0_131]
at javax.net.ssl.SSLEngine.wrap(SSLEngine.java:469) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doWrap(SSLIOSession.java:265) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:305) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:509) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[?:1.8.0_131]
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) ~[?:1.8.0_131]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:283) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:353) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) ~[?:1.8.0_131]
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:230) ~[?:1.8.0_131]
at sun.security.validator.Validator.validate(Validator.java:260) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:281) ~[?:1.8.0_131]
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1501) ~[?:1.8.0_131]
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:966) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$1.run(Handshaker.java:963) ~[?:1.8.0_131]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1416) ~[?:1.8.0_131]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:283) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0
.0-beta1]
at org.elasticsearch.client.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:353) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6
.0.0-beta1]
Caused by: sun.security.provider.certpath.SunCertPathBuilderException:


(Sachin) #11

I was able to resolve the issue - i had to append certificate chain to CACERTS of JAVA HOME directory and then i was able to connect via HTTPS port.

Thanks
Sachin


(David Pilato) #12

Great news! Would you like to contribute a documentation PR to FSCrawler project? That'd help a lot other people.


(Sachin) #13

sure thing but i am not sure the process for the same. Shall i just document SSL related steps and send it your way?


(David Pilato) #14

Yeah. Editing the README.md would be even better.


(Sachin) #15

Will do - allow me few days and i will get back to you on this.


(Sachin) #16

I have proposed changed to README.md file to include SSL configuration steps. This is my first attempt to propose changes to an open source tool and i hope i have done it in right way.

Let me know.


(David Pilato) #17

I don't see any PR you opened in https://github.com/dadoonet/fscrawler/pulls

I believe this is your commit in your branch:

So I created one PR for you.

I'll review it shortly and merge. Thanks!


(Sachin) #18

sounds good - thanks!!


(system) #19

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.