Indexing many pdf files

I want to index many pdf files. I read about ingest attachment plugin. I also researched for examples online. One of them is Ingesting and Exploring Scientific Papers using Elastic Cloud. However, I have not yet found a tutorial that shows step by step how to index pdf files for a beginner. Anyone know a good example on how to index pdf files? Currently, I am using the PyPDF2 python library to read pdf files and then index them using the elasticsarch python client. However, I noticed PyPDF2 does not read some of the pdf files appropriately and that is why I want to try the ingest attachment plugin. I am using AWS Elasticsearch service and the ingest attachment plugin has already been installed.

Have also a look at FSCrawler project which might help.

BTW did you look at https://www.elastic.co/cloud and https://aws.amazon.com/marketplace/pp/B01N6YCISK ?

Cloud by elastic is the only way to have access to X-Pack. Think about what is there yet like Security, Monitoring, Reporting and what is coming like Canvas, SQL...

2 Likes

FSCrawler looks a great solution for my problem. I created put pdf files in c:\tmp\es but I am getting "string out of range: -1" error message. What could be the problem?error

As you can see below, I have put many pdf files in tmp\es folder.

Thank you
pdfs

Can you try the latest snapshot?

I still get the same error message. It says " we found old configuration setting" as you can see from the screenshot below. I do not know what it is. Thanks.
error

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Can you remove the .fscrawler dir? It should be in your home directory.
Then restart from scratch?

If it fails again, please activate debug mode and copy here the full logs.

I have removed the the .fscrawler dir but still getting the same error message. Thanks.

 15:49:27,070 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
15:49:27,073 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
15:49:27,073 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
15:49:27,076 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [job_name]...
15:49:28,367 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
15:49:28,407 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [.\test] or [.\test\job_name\_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
15:49:28,408 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:49:28,408 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:49:28,412 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.2.4] node.
15:49:28,413 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [job_name]
15:49:28,421 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [job_name_folder]
15:49:28,428 DEBUG [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [job_name] for [/tmp/es] every [15m]
15:49:28,428 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
15:49:28,429 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [job_name] is now running. Run #1...
15:49:28,436 WARN  [f.p.e.c.f.FsCrawlerImpl] Error while crawling /tmp/es: String index out of range: -1
15:49:28,436 WARN  [f.p.e.c.f.FsCrawlerImpl] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.String.substring(String.java:1967) ~[?:1.8.0_131]
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.indexDirectory(FsCrawlerImpl.java:701) ~[fscrawler-core-2.5-SNAPSHOT.jar:?]
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:309) [fscrawler-core-2.5-SNAPSHOT.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
15:49:28,439 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 15m

GPlease format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

What does your fscrawler test job setting file looks like?

It is shown below. I did not make any changes to it.

{
  "name" : "job_name",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

May be change the dir name to /c:/tmp/es or something like this?

changing the url to "C:\tmp\es" with double backslash worked. It does not work with single backslash Thanks a lot. fscrawler is super cool! My final question is whether I can use it to index to a local Elastisearch only or I can also use it to index to AWS ? Thank you

Here is a related issue

Can you run again with the trace level debug and share there your full logs and settings?

I'm away from keyboard for a week so I can't really look for details now.

May be this discussion can help as well:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.