Create custom source elasticWorkplace

JorgeL-TI · October 26, 2020, 5:25pm

This is the terminal

jorge@ubuntu:~/Escritorio/FSCRAWLER/FSCrawlerWorkplace/fscrawler-es7-2.7-SNAPSHOT/bin$ ./fscrawler /home/jorge/Escritorio/FSCRAWLER/fscrawler/prueba --debug --restart

^[[D18:15:26,782 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [129.9mb/2.8gb=4.45%], RAM [4.1gb/11.7gb=35.27%], Swap [1.9gb/1.9gb=100.0%].
18:15:26,787 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
18:15:26,787 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
18:15:26,787 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
18:15:26,788 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
18:15:26,789 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [/home/jorge/Escritorio/FSCRAWLER/fscrawler/prueba]...
18:15:26,789 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [/home/jorge/Escritorio/FSCRAWLER/fscrawler/prueba]...

^[[D^[[D^[[D^[[D^[[D18:15:27,198 INFO  [f.p.e.c.f.c.FsCrawlerCli] Workplace Search integration is an experimental feature. As is it is not fully implemented and settings might change in the future.
18:15:27,199 WARN  [f.p.e.c.f.c.FsCrawlerCli] Workplace Search integration does not support yet watching a directory. It will be able to run only once and exit. We manually force from --loop -1 to --loop 1. If you want to remove this message next time, please start FSCrawler with --loop 1
18:15:27,201 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
18:15:27,208 DEBUG [f.p.e.c.f.c.WorkplaceSearchClientUtil] Trying to find a client version 7

18:15:27,219 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
^[[C18:15:28,025 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.9.2
18:15:28,261 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started
18:15:28,263 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Starting the WPSearchClient
18:15:28,319 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
18:15:28,338 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.9.2

18:15:28,344 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Workplace 
Search Document Service started

18:15:28,349 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [prueba] for [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf] every [15m]
18:15:28,355 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [prueba] for [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf] every [15m]
18:15:28,357 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [prueba] is now running. Run #1...
18:15:28,377 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf] content
18:15:28,378 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from //home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf
18:15:28,382 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Symlink on windows gives null for listFiles(). Skipping [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf]
18:15:28,389 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 0 local files found
18:15:28,389 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf]...
18:15:28,449 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf]...
18:15:28,461 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
18:15:28,558 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [prueba]
18:15:28,559 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
18:15:28,559 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
18:15:28,561 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
18:15:28,562 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
18:15:28,563 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Closing the WPSearchClient
18:15:28,563 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
18:15:28,563 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
18:15:28,563 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Workplace Search Document Service stopped
18:15:28,563 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
18:15:28,563 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [prueba] stopped
18:15:28,569 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [prueba]
18:15:28,570 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
18:15:28,570 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
18:15:28,571 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
18:15:28,571 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
18:15:28,571 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Closing the WPSearchClient

18:15:28,571 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Workplace Search Document Service stopped
18:15:28,571 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped

I think that it is detecting workplace but it is doing nothing with it.
The file that I used it was indexed before the test, so it is not going to index it again

This is my YAML

---
name: "prueba"
fs:
  url: "//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  username: "elastic"
  password: "L3pfydSSgRtZxfg5gWmX"
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
workplace_search:
  access_token: "489e56799532ca13c49161f82093a41387fca45458617277705b5e8d0e250e77"
  key: "5f959a6e1d41c88afcdc280e"

I have just added the two lines in the end.

Thank you very much

dadoonet · November 20, 2020, 3:11pm

So this is interesting:

18:15:28,382 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Symlink on windows gives null for listFiles(). Skipping [//home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros//ejemplo_esp.pdf]
18:15:28,389 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 0 local files found

Do you have a symbolic link to your files is it a copy of the files that you put in //home//jorge//Escritorio//FSCRAWLER//fscrawler//ficheros?

BTW I think that using:

fs:
  url: "/home/jorge/Escritorio/FSCRAWLER/fscrawler/ficheros"

Should work better. Could you try?

JorgeL-TI · December 7, 2020, 5:44pm

Hi,

Thank you for your answer.
I am going to try it!!

Thank you very much

JorgeL-TI · December 16, 2020, 2:31pm

Hello,

Sorry, because have a lot of work these days
I tried the fscrawler-es7-2.7-20201202.135628-144.zip version.

I have found that it works with ElasticSeacrh but not with Elastic Workplace Search.

I run the command

bin/fscrawler /home/jorge/Escritorio/Fscrawler/fuentes/prueba --restart--debug

It does not write anything, but the program index the different documents in ElasticSearch.

I've watched the ElasticOn program you broadcasted and I watched how you run your version, so you were able to index differents documents in Elastic Workplace Search. I think I have a problem with my configuration.

name: "prueba"
fs:
  url: "/home/jorge/Escritorio/Fscrawler/ficheros"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  username: "elastic"
  password: "elastic"
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  
workplace_search:
  access_token: "46ebd34f99f08d0eefae3c95cf956c7be45b812c3e992fed1c01effbdcbbd498"
  key: "5fd8cd9e1d41c8910eeb63ee"
  server: "http://localhost:3002"

Thank you very much
Regards

dadoonet · December 16, 2020, 3:41pm

Yes. Not that master branch yet. See my previous comment:

Great!

Have a look at this advent calendar post: Dec 5th, 2020: [EN] Searching anything, anywhere with Workplace Search

It's a tutorial basically on how to run all that.

JorgeL-TI · December 16, 2020, 9:21pm

Hi,

I am very very happy because I was able to index some documents in Elastic WorkPlace Search.
Thank you very much.

I want to add a keyword whose content is the index (for looking for the different documents)
is it possible??

The other question I have is how can I find the index of a document out?

Thank you very very much

dadoonet · December 17, 2020, 7:39am

I'm afraid I don't understand that question.

What do you mean by "the index"? The index name used by elasticsearch behind the scene?

JorgeL-TI · December 17, 2020, 9:08am

Hi,

In the ElasticOn you add two keywords queue1 queue2.
How can I add this?

Sorry I mean the ID of the document in the elastic WorkPlaceSearch (no the index, I am sorry)
For example, if I want to select a document in a Search I can use it.

Thank you very much

dadoonet · December 17, 2020, 11:20am

FSCrawler extracts automatically the metadata for you. The keyword metadata is sent to a keyword field. If your PDF or Office document has those keywords, they are extracted and sent to Workplace Search.

You don't add manually this information in Workplace Search or FSCrawler.

You can check if this is extracted by using the --debug or --trace mode of FSCrawler (I don't remember exactly which one produces that information. Debug should be enough). Then the logs should display the document meant to be sent to Workplace Search.

If it's generated, then you can hopefully search for that as @Sean_Story did during the demo.

JorgeL-TI · December 21, 2020, 11:22am

Hello,
I need to put a label on the document, so I need to change the metadata of a PDF, and I find it too difficult.
I do not know if there is a possibility of using a pipeline or something similar.
My first idea was to use the name of the folder, so if I can name it with the topic I can use it as a keyword.
I need to add some extra information, but I do not know if it possible.

Thank you very much

dadoonet · December 21, 2020, 5:33pm

I believe that you would like something similar to https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags?

There could be some options with Elasticsearch (like extracting from the full path, the tag to add) but this is not possible with the Workplace Search output as it does not support using an ingest pipeline.

Would you like to open a feature request about this in FSCrawler project? May be we can implement something in the future?

For sure, I don't see any option in the short term about that.

JorgeL-TI · January 15, 2021, 10:52pm

Hello,
I am looking for the way to add some external labels, but I cannot find the template in the label fscrawler (Because I am not looking in the correct place I think).

I want to add an external label, using it for filtering the data and I do not know if I can add it in the _settings.yaml file.

I am trying to use this sentence ~/.fscrawler/datos/_settings.json where datos is my index.

Could you explain to me how I access the template?

Thank you very much again

dadoonet · January 16, 2021, 1:16am

There's not such an option in FSCrawler.
But that's something which could be added. In the same way beats agents support adding a static pair of field:value to each document.

See for example:

But note that this would be static.

Could you please open a feature request in FSCrawler project?

JorgeL-TI · January 18, 2021, 10:41pm

Hi,

I do not know if it a good option doing it with an elastic pipeline, and with a processor with a "set".
What I do not know is how to use the pipeline in a whole index.

I want to modify all the docs adding the tag.

Thank you very much

dadoonet · January 19, 2021, 4:33am

Not sure how this relates to my own answer.

Is there a question?

JorgeL-TI · January 19, 2021, 7:19pm

the question is if I have to use file beat instead od a pipeline in elastic because I have not used it before.

Is there a possibility to add a tag using a pipeline in ElasticSearch?

I understand that you index the document and later on you add the different tags, doesn´t it?
So, I want to configue a pipeline for a whole node and not just for a simple document

Thank you very much

dadoonet · January 19, 2021, 8:10pm

I just meant that I can add a similar feature that the one which exists in filebeat. Not that you have to use filebeat.

Just open a feature request and I'll see if I can work on that.

JorgeL-TI · January 20, 2021, 3:21pm

Ok Thank you very much

JorgeL-TI · January 20, 2021, 5:24pm

Hi,
I have read the filebeat documentation.
I do not understand how I can connect it with FSCRAWLER, because I have all my documents in ElasticSearch at this moment now, and I find that Filleabeat it is a "previous process" before indexing the data in ElasticSearch.
I find that it was thought for logs, and I do not know if I can use for PDF.
Of course, I have just started reading the documentation, and I am going to continue doing it

Thank you very much

JorgeL-TI · January 24, 2021, 9:12pm

Hello,

I think that I have figured out how to add a tag.

First, I create a new field with command _mapping.
Later on, I make an _update, and I add a value to the field.

I did with KQL using the Kibana terminal.
One implementation could be to configure the label to the whole index.

In the end, I am using just FSCrawler, ElasticSearch and Kibana.

Thank you very much for your help I am very grateful

Topic		Replies	Views
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	39011	March 21, 2017
ElasticSearch Indexing question Elasticsearch	22	3838	July 5, 2017
Dec 5th, 2020: [EN] Searching anything, anywhere with Workplace Search Advent Calendar	1	1644	January 2, 2021
[ANNOUNCEMENT] - fscrawler 2.5 released Community Ecosystem	3	2271	June 23, 2020
Dec 5th, 2020: [FR] Recherchez tous vos documents, n'importe où, avec Workplace Search Advent Calendar	1	1524	January 2, 2021

Create custom source elasticWorkplace

Related topics