Index pdf files to AWS Elasticsearch service using Elasticsearch File System Crawler

Fish · May 21, 2018, 2:28pm

I can index pdf files to a local Elasticsearch using Elasticsearch File System Crawler. The default, fscrawler setting has port, host and scheme parameters as shown below.

{
  "name" : "job_name2",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

However, I have difficulty using it to index to AWS elasticsearch service because to index to AWS elasticsearch, I have to provide the AWS_ACCESS_KEY, AWS_SECRET_KEY, region, and service as documented here. Any help on how to index pdf files to AWS elasticsearch service is highly appreciated.

dadoonet · May 28, 2018, 8:39am

I know I tested it with the official cloud by elastic offer (see below).
I don't know how AWS service works exactly but I guess you can have a username and password?
In which case you can define them in the elasticsearch.nodes setting?

See https://github.com/dadoonet/fscrawler#elasticsearch-settings for more details.

BTW did you look at https://www.elastic.co/cloud and https://aws.amazon.com/marketplace/pp/B01N6YCISK ?

Cloud by elastic is the only way to have access to X-Pack. Think about what is there yet like Security, Monitoring, Reporting and what is coming like Canvas, SQL...

system · June 25, 2018, 8:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler Elasticsearch	2	2946	September 28, 2017
Can fscrawler index files from different servers? Elasticsearch	10	1524	February 12, 2020
Fscrawler for ES clustering Elasticsearch	41	2088	March 18, 2020
How to use FSCrawler to send elasticsearch Base64 encoded PDF? Elasticsearch	5	875	May 17, 2018
Specifying an elasticsearch index from fscrawler rest api Elasticsearch	14	2418	January 21, 2020

Index pdf files to AWS Elasticsearch service using Elasticsearch File System Crawler

Related topics