Indexing Sharepoint files(mounted to network driver) using fsCrawler

Hi,

I have mapped share point site as a network driver to my windows server 2019.


The path is W:\fsSharepointFiles
Now I installed Java, fsCrawler and started indexing these files. Below are the steps I followed.
indent preformatted text by 4 spaces

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>java -version
    java version "1.8.0_241"
    Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
    Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>set JAVA_HOME=c:\Program Files\Java\jdk1.8.0_241

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint
    03:20:44,652 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.3mb/247.5mb=4.19%], RAM [113.1mb/1023.6mb=11.06%], Swap [1.9gb/3.5gb=55.3%].
    03:20:44,715 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [index_sharepoint] does not exist
    03:20:44,715 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
    y
    03:20:47,746 INFO  [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [C:\Users\Administrator\.fscrawler\index_sharepoint\_settings.yaml]. Please review and edit before relaunch

Then I have changed the _settings.yaml. below is my settings file

---
name: "index_sharepoint"
fs:
  url: "W:\fsSharepointFiles"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://50.116.48.42:9200"
  password: devpass
  username: dev
  bulk_size: 5
  flush_interval: "5s"
  byte_size: "10mb"

Then again I have started the crawler which is giving me error "W:\fsSharepointFiles" path doesnot exist.

C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint
03:59:17,164 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.4mb/247.5mb=4.22%], RAM [87.3mb/1023.6mb=8.53%], Swap [1.9gb/3.5gb=53.64%].
03:59:19,508 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
03:59:19,664 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:59:19,664 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
03:59:20,789 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_sharepoint] for [W:sSharepointFiles] every [15m]
03:59:20,883 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling W:sSharepointFiles: W:sSharepointFiles doesn't exists.

Attaching the image, since the error has some symbol which disappeared when I cut paste here.

My mapping is working fine. When I added files in my sharepoint VM, those are getting reflected in my network drive.

Why Crawler is not able to find this directory? Did I miss something in the settings file?
Could you please specify what am I missing here?

And why is it waiting after the warning?
the command for linux "--loop 1" is same for windows too?

-Lisa

I changed the path to '\sharepoint VM IP\W$\fsSharepointFiles'
according to Pointing FSCrawler to a separate server for documents
But it keeps on saying the job doesnot exist, Enter 'Y/N' to create the job.

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint
    05:05:17,404 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.4mb/247.5mb=4.23%], RAM [69.4mb/1023.6mb=6.78%], Swap [1.9gb/3.5gb=52.95%].
    05:05:17,982 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [index_sharepoint] does not exist
    05:05:17,982 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
    y
    05:05:20,700 INFO  [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [C:\Users\Administrator\.fscrawler\index_sharepoint\_settings.yaml]. Please review and edit before relaunch        

Saved the settings file with the changes, and started the job agian

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint
    05:07:06,483 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.4mb/247.5mb=4.23%], RAM [47mb/1023.6mb=4.59%], Swap [1.8gb/3.5gb=52.62%].
    05:07:07,170 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [index_sharepoint] does not exist
    05:07:07,170 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
    y

It keeps on asking the same question again and again. So I changed the path in settings.yaml file to the original "W:\fsSharepointFiles"

so back to my first question in this thread!!

-Lisa

Could you use \\sharepoint VM IP\\W$\\fsSharepointFiles instead?

Hello,

The same Error is coming.

11:02:33,957 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \52.163.14.180\W$\fsSharepointFiles: \52.163.14.180\W$\fsSharepointFiles doesn't exists.

-Lisa

Could you try

W:\\fsSharepointFiles

Or

\\\\W:\\fsSharepointFiles

Hi David,

I am so sorry.. this worked - W:\\fsSharepointFiles
But there are warnings in the debug logs, Failed to determine the owner of the file. Other than this, everything seems working. I searched few words through search query in Kibana, output looks good.

    C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint --debug
    11:09:22,977 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.3mb/247.5mb=4.19%], RAM [126.6mb/1023.6mb=12.37%], Swap [1.7gb/3.5gb=49.46%].
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [index_sharepoint]...
    11:09:23,493 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
    11:09:24,946 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
    11:09:25,071 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
    11:09:25,087 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
    11:09:25,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.5.1] node.
    11:09:25,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint]
    11:09:25,993 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint]
    11:09:26,055 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint_folder]
    11:09:26,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint_folder]
    11:09:26,165 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
    11:09:26,180 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
    11:09:26,180 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [index_sharepoint] is now running. Run #1...
    11:09:26,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles) = /
    11:09:26,321 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint_folder/e6e39586a01b119482edbc6549b99d21?pipeline=null
    11:09:26,337 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [W:\fsSharepointFiles] content
    11:09:26,337 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from W:\fsSharepointFiles
    11:09:28,790 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile2.txt: W:\fsSharepointFiles\fsSharepointfile2.txt: Incorrect function.

    11:09:28,837 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile1.txt: W:\fsSharepointFiles\fsSharepointfile1.txt: Incorrect function.

    11:09:28,868 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile4.txt: W:\fsSharepointFiles\fsSharepointfile4.txt: Incorrect function.

    11:09:28,915 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile3.txt: W:\fsSharepointFiles\fsSharepointfile3.txt: Incorrect function.

    11:09:28,930 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 4 local files found
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile2.txt) = /fsSharepointfile2.txt
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile2.txt], includes = [null], excludes = [[*/~*]]
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile2.txt], excludes = [[*/~*]]
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile2.txt], includes = [null]
    11:09:28,930 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile2.txt] can be indexed: [true]
    11:09:28,930 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile2.txt
    11:09:29,024 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile2.txt]
    11:09:29,024 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile2.txt) = /fsSharepointfile2.txt
    11:09:29,102 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
    11:09:29,134 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
    11:09:30,430 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
    See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
    for optional dependencies.

    11:09:31,024 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
    11:09:31,024 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/7a14b8708c8b2ddceb2f3f3657e6889?pipeline=null
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile1.txt) = /fsSharepointfile1.txt
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile1.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile1.txt], excludes = [[*/~*]]
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile1.txt], includes = [null]
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile1.txt] can be indexed: [true]
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile1.txt
    11:09:31,602 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile1.txt]
    11:09:31,633 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile1.txt) = /fsSharepointfile1.txt
    11:09:31,680 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/cde27657a17be8db852aecb17b97ad6d?pipeline=null
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile4.txt) = /fsSharepointfile4.txt
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile4.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile4.txt], excludes = [[*/~*]]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile4.txt], includes = [null]
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile4.txt] can be indexed: [true]
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile4.txt
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile4.txt]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile4.txt) = /fsSharepointfile4.txt
    11:09:31,727 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/fb1515bfb746acca7b73a1b88ac52?pipeline=null
    11:09:31,727 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile3.txt) = /fsSharepointfile3.txt
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile3.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile3.txt], excludes = [[*/~*]]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile3.txt], includes = [null]
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile3.txt] can be indexed: [true]
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile3.txt
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile3.txt]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile3.txt) = /fsSharepointfile3.txt
    11:09:31,899 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/81e0219ddfdaa3c1641fcdb73f78d8?pipeline=null
    11:09:31,899 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [W:\fsSharepointFiles]...
    11:09:31,977 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [W:\fsSharepointFiles]...
    11:09:32,258 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

Very interesting. If you don't mind, please share the logs in a new issue in FSCrawler. I'd like to have a better way to catch this error.
I guess this is because it's a network drive.

Sure, thank you so much!!

I am seeing the same Error "W:\fsSharepointFiles" doesnot exist when I ssh to the ec2 windows instance and ran the fsCrawler.

FsCrawler is indexing files properly when I RDP to ec2 windows instance and ran the crawler from command prompt as you have seen in this thread.

But When I ssh to the instance and repeat the same steps, the same path does not exist error is coming.
I need to execute the whole process through a script instead of doing it manually. that's why I am trying through ssh. My idea is to ssh to the instance and run the commands.But dont know why the same commands are working in the command prompt are not working through ssh session.

administrator@EC2AMAZ-KPLF2CI C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint --debug
Unable to get Charset 'cp0' for property 'sun.stdout.encoding', using default UTF-8 and continuing.
15:27:48,488 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [9.9mb/247.5mb=4.02%], RAM [157mb/1023.6mb=15.34%], Swap [1.6gb/3.5gb=46.84%].
15:27:48,519 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
15:27:48,519 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
15:27:48,519 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
15:27:48,519 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
15:27:48,534 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [index_sharepoint]...
15:27:49,050 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
15:27:50,691 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
15:27:50,800 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
15:27:50,800 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
15:27:50,816 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.5.1] node.
15:27:50,816 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint]
15:27:51,644 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint]
15:27:51,706 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint_folder]
15:27:51,769 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint_folder]
15:27:51,847 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
15:27:51,863 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
15:27:51,863 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [index_sharepoint] is now running. Run #1...
15:27:51,863 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling W:\fsSharepointFiles: W:\fsSharepointFiles doesn't exists.
15:27:51,863 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.RuntimeException: W:\fsSharepointFiles doesn't exists.
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:130) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]
15:27:51,878 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

path in the settings.yaml is W:\\fsSharepointFiles
\\\\W:\\fsSharepointFiles also not working!

-Lisa

Is there any difference running the crawler from command prompt to ssh session in windows?

When you ssh to the machine, can you run:

ls -l w:\\fsSharepointFiles

If not, it might mean that the dir is not visible with that name.

BTW is your fsSharepointFiles drive accessible from ssh directly? I mean: is there a ssh server on the Share Point server? In which case you could try to use SSH with FSCrawler.

I am not able to navigate to W: folder

administrator@EC2AMAZ-KPLF2CI C:\Windows>w:
The system cannot find the drive specified.

administrator@EC2AMAZ-KPLF2CI C:\Windows>

In the Sharepoint VM, I didnot install OpenSSH. but even if i can ssh to the share point VM, the files are in site collection. which has url like "http:///Shared%20Documents".
This path wont work with FsCrawler right?

So FSCrawler won't be able either. No magic here :slight_smile:

It won't. Crawling URLs is not supported.
But Workplace Search can do that I think.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.