FsCrawler does not do anything, does not index pfd's

This n00b is thoroughly confused.

I have severa quite large pdf files I want to be able to full text search. They comrpise of raw text and images contaning text.

So I want to index the text in the pdf, OCR the images and index the result as well.

To this end I have setup an Ubuntu box with Elasticsearch and Kibana, and setup FSCrawler.

I get the json result from Elasticsearch and the kibaba website is functioning without issues.

I have put all PDF's in a folder, and created a fscrawler job to index them.

The job runs, but it appears to be doing nothing at all. It just sits there.

In kibana I can see an index, but it appears to be empty.

Here the fscrawler job:

---
name: "resumes"
fs:
  url: "/root/data/"
  update_rate: "1m"
  includes:
  - "*/*.pdf"
  json_support: true
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: true
  store_source: true
  index_content: true
  attributes_support: false
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: true
elasticsearch:
  nodes:
  - url: "http://192.168.1.126:8000"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

This is what I get when this job runs.. after some time, more of the same:

 [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'

15:37:07,840 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [34.3mb/949.3mb=3.62%], RAM [194mb/3.8gb=4.94%], Swap [0b/0b=0.0].
15:37:09,229 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:37:09,234 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:37:11,898 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:12,380 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:13,156 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [resumes] for [/root/data/] every [1m]
15:37:13,318 WARN  [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
15:37:13,365 WARN  [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

I cannot seam to get more verbose output. I would have expected some output like:

"found file input.pdf"
"scanning an indexing text"
"found image, applying ocr"
....

or something like that. What am I doing wrong?

I used the latest versions of everything.

I have since also installed tesseract using apt (although the fscrawler download should contain it) and tika 2.2.1.

To no avail.

If you have already run the job, that's expected as FSCrawler will only get the modified / added documents.

You can use the --restart option to make sure it starts again from scratch.

facepalm self

Thanks for your answer. This turned out to be the solution. I assumed since it hadn't actually indexed the documents yet, it would do that. It did not.

The --restart option fixed the issue. Should have found that myself.

In any case, thanks very much for taking the time to answer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.