FSCrawler Question

Hello,

I am trying to use fscrawler, would like to know if the following use case can be achieved using fscrawler.

  1. I already have an index / type created with my custom mappings set
  2. i am using the ingest-attachment plugin along with the ingest processor

Question : Can i use fscrawler to index pdf files into specified index/type/doc & specific field using the configurations / rest api ?

reason for doing this: i have very large documents which i would like to index, & the application i am using is on a windows ecosystem (using NEST client), getting a base64string out of large documents is giving me memory issues, so as an alternative would like to check if fscrawler can pull the documents directly & index them in specified index/type & against specific documentid (in specific field defined for attachment type)

Let me know if this all makes sense.

Can i use fscrawler to index pdf files into specified index/type/doc & specific field using the configurations / rest api ?

Yes. But FSCrawler gives less flexibility than ingest processors about field names. But the good news is that you can process a file with FSCrawler which will send it to elasticsearch through an ingest pipeline where you can simply rename a field to the desired target field you wish. See GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)

Is it what you are looking for?

thanks @dadoonet for the quick response, i will try this out & update with my findings by tomorrow.

Hello @dadoonet,

I tried the above suggestion, but somehow the .bat doesn't do anything i.e. fscrawler doesn't start up at all I am using windows 7 & have an ES node running on my local system, following error comes up on the console if i attempt to terminate the batch job

D:\DevStuff\FsCrawler\bin>fscrawler --config_dir "D:\DevStuff\FsCrawler\attachtest_attach" attachtest_attach --loop 0 --rest
Exception in thread "main" java.util.NoSuchElementException
        at java.util.Scanner.throwFor(Scanner.java:862)
        at java.util.Scanner.next(Scanner.java:1371)
        at fr.pilato.elasticsearch.crawler.fs.FsCrawler.main(FsCrawler.java:212)

Terminate batch job (Y/N)? y

verified java version as follows

java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Following is the _settings.json i have created,

{
  "name": "attachtest_attach",
  "fs": {
    "url": "D:\Files_Indexing\",
	"lang_detect" : true,
	"indexed_chars": "-1"
  },
  "elasticsearch": {
    "index": "attachtest",
    "type": "attach",
	"pipeline" : "attachment-pipeline",
	"nodes" : [
      { "host" : "127.0.0.1", "port" : 9200, "scheme" : "HTTP" }
    ]
  }
}

The above _settings.json is placed in the following location, also specified the same against the --config_dir param

D:\DevStuff\FsCrawler\attachtest_attach\_settings

Could you please let me know if i am missing something obvious.

Can you try with either:

  • --config_dir "D:\\DevStuff\\FsCrawler\\attachtest_attach"
  • --config_dir "D:/DevStuff/FsCrawler/attachtest_attach"

Tried the below, but no luck, the below errors are displayed when i terminate the bat (ctrl+c)

D:\DevStuff\FsCrawler\bin>fscrawler --config_dir "D:\\DevStuff\\FsCrawler\\attac
htest_attach" attachtest_attach --loop 0 --rest
Exception in thread "main" java.util.NoSuchElementException
        at java.util.Scanner.throwFor(Scanner.java:862)
        at java.util.Scanner.next(Scanner.java:1371)
Terminate batch job (Y/N)? y

D:\DevStuff\FsCrawler\bin>fscrawler --config_dir "D:/DevStuff/FsCrawler/attachte
st_attach" attachtest_attach --loop 0 --rest
Exception in thread "main" java.util.NoSuchElementException
        at java.util.Scanner.throwFor(Scanner.java:862)
        at java.util.Scanner.next(Scanner.java:1371)Terminate batch job (Y/N)? y

Is there any way we can generate logs for fscrawler

Was a bug on windows. Read https://github.com/dadoonet/fscrawler/issues/320#issuecomment-280726439

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.