Indexing many pdf files

Fish · May 18, 2018, 4:37pm

I want to index many pdf files. I read about ingest attachment plugin. I also researched for examples online. One of them is Ingesting and Exploring Scientific Papers using Elastic Cloud. However, I have not yet found a tutorial that shows step by step how to index pdf files for a beginner. Anyone know a good example on how to index pdf files? Currently, I am using the PyPDF2 python library to read pdf files and then index them using the elasticsarch python client. However, I noticed PyPDF2 does not read some of the pdf files appropriately and that is why I want to try the ingest attachment plugin. I am using AWS Elasticsearch service and the ingest attachment plugin has already been installed.

dadoonet · May 18, 2018, 5:12pm

Have also a look at FSCrawler project which might help.

BTW did you look at https://www.elastic.co/cloud and https://aws.amazon.com/marketplace/pp/B01N6YCISK ?

Cloud by elastic is the only way to have access to X-Pack. Think about what is there yet like Security, Monitoring, Reporting and what is coming like Canvas, SQL...

Fish · May 18, 2018, 6:05pm

FSCrawler looks a great solution for my problem. I created put pdf files in c:\tmp\es but I am getting "string out of range: -1" error message. What could be the problem? error

As you can see below, I have put many pdf files in tmp\es folder.

Thank you
pdfs

dadoonet · May 18, 2018, 6:47pm

Can you try the latest snapshot?

Fish · May 18, 2018, 7:10pm

I still get the same error message. It says " we found old configuration setting" as you can see from the screenshot below. I do not know what it is. Thanks.
error

dadoonet · May 18, 2018, 7:25pm

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Can you remove the .fscrawler dir? It should be in your home directory.
Then restart from scratch?

If it fails again, please activate debug mode and copy here the full logs.

Fish · May 18, 2018, 7:51pm

I have removed the the .fscrawler dir but still getting the same error message. Thanks.

 15:49:27,070 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
15:49:27,072 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
15:49:27,073 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
15:49:27,073 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
15:49:27,076 DEBUG [f.p.e.c.f.c.FsCrawler] Starting job [job_name]...
15:49:28,367 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
15:49:28,407 WARN  [f.p.e.c.f.c.FsCrawler] We found old configuration index settings in [.\test] or [.\test\job_name\_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
15:49:28,408 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:49:28,408 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:49:28,412 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [6.2.4] node.
15:49:28,413 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [job_name]
15:49:28,421 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [job_name_folder]
15:49:28,428 DEBUG [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [job_name] for [/tmp/es] every [15m]
15:49:28,428 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]
15:49:28,429 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [job_name] is now running. Run #1...
15:49:28,436 WARN  [f.p.e.c.f.FsCrawlerImpl] Error while crawling /tmp/es: String index out of range: -1
15:49:28,436 WARN  [f.p.e.c.f.FsCrawlerImpl] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.String.substring(String.java:1967) ~[?:1.8.0_131]
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.indexDirectory(FsCrawlerImpl.java:701) ~[fscrawler-core-2.5-SNAPSHOT.jar:?]
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:309) [fscrawler-core-2.5-SNAPSHOT.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
15:49:28,439 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 15m

dadoonet · May 18, 2018, 8:11pm

GPlease format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

What does your fscrawler test job setting file looks like?

Fish · May 18, 2018, 8:23pm

It is shown below. I did not make any changes to it.

{
  "name" : "job_name",
  "fs" : {
    "url" : "/tmp/es",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

dadoonet · May 18, 2018, 8:40pm

May be change the dir name to /c:/tmp/es or something like this?

Fish · May 18, 2018, 10:34pm

changing the url to "C:\tmp\es" with double backslash worked. It does not work with single backslash Thanks a lot. fscrawler is super cool! My final question is whether I can use it to index to a local Elastisearch only or I can also use it to index to AWS ? Thank you

dadoonet · May 19, 2018, 1:23am

Here is a related issue

Can you run again with the trace level debug and share there your full logs and settings?

I'm away from keyboard for a week so I can't really look for details now.

May be this discussion can help as well:

system · June 16, 2018, 1:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Indexing word, pdf documents? Elasticsearch	12	6129	July 7, 2020
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7780	March 29, 2021
Search a PDF file using its content Elasticsearch	9	15792	February 11, 2019
Can we index .zip file using ingest attachment plugin? Elasticsearch	13	3620	April 25, 2019

Indexing many pdf files

Related topics