FsCrawler does not do anything, does not index pfd's

jank12 · February 9, 2022, 3:39pm

This n00b is thoroughly confused.

I have severa quite large pdf files I want to be able to full text search. They comrpise of raw text and images contaning text.

So I want to index the text in the pdf, OCR the images and index the result as well.

To this end I have setup an Ubuntu box with Elasticsearch and Kibana, and setup FSCrawler.

I get the json result from Elasticsearch and the kibaba website is functioning without issues.

I have put all PDF's in a folder, and created a fscrawler job to index them.

The job runs, but it appears to be doing nothing at all. It just sits there.

In kibana I can see an index, but it appears to be empty.

Here the fscrawler job:

---
name: "resumes"
fs:
  url: "/root/data/"
  update_rate: "1m"
  includes:
  - "*/*.pdf"
  json_support: true
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: true
  store_source: true
  index_content: true
  attributes_support: false
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: true
elasticsearch:
  nodes:
  - url: "http://192.168.1.126:8000"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

This is what I get when this job runs.. after some time, more of the same:

 [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'

15:37:07,840 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [34.3mb/949.3mb=3.62%], RAM [194mb/3.8gb=4.94%], Swap [0b/0b=0.0].
15:37:09,229 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:37:09,234 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:37:11,898 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:12,380 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.0
15:37:13,156 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [resumes] for [/root/data/] every [1m]
15:37:13,318 WARN  [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
15:37:13,365 WARN  [o.e.c.RestClient] request [POST http://192.168.1.126:8000/resumes_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.0-bee86328705acaa9a6daede7140defd4d9ec56bd "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

I cannot seam to get more verbose output. I would have expected some output like:

"found file input.pdf"
"scanning an indexing text"
"found image, applying ocr"
....

or something like that. What am I doing wrong?

I used the latest versions of everything.

jank12 · February 9, 2022, 9:56pm

I have since also installed tesseract using apt (although the fscrawler download should contain it) and tika 2.2.1.

To no avail.

dadoonet · February 10, 2022, 12:15am

If you have already run the job, that's expected as FSCrawler will only get the modified / added documents.

You can use the --restart option to make sure it starts again from scratch.

jank12 · February 10, 2022, 11:21pm

facepalm self

Thanks for your answer. This turned out to be the solution. I assumed since it hadn't actually indexed the documents yet, it would do that. It did not.

The --restart option fixed the issue. Should have found that myself.

In any case, thanks very much for taking the time to answer.

system · March 10, 2022, 11:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FS Crawler appeared to work but Kibana displays 0 results Elasticsearch	6	496	November 3, 2021
Fscrawler only indexed 59 of a 2000 page pdf Elasticsearch	2	244	September 14, 2022
FScrawler does not scans the `/tmp/es` Elasticsearch	2	533	December 28, 2021
Fscrawler not indexing the tif / tiff files in elastic search Elasticsearch	4	598	November 14, 2019
Fscrawler Elasticsearch	2	2947	September 28, 2017

FsCrawler does not do anything, does not index pfd's

Related topics