Create custom source elasticWorkplace

Hello,
I am using workplace search, and I want now to create a customised source.
I explain my project.
It consists of a search of documents (PDFs) that need to have a label with their topic.
So I try to download this program GitHub - elastic/workplace-search-ruby: Elastic Workplace Search Official Ruby Client, but I do not know how it works.
On the other hand, I tried to use FSCrawler for ingesting the pdf document, but I neither am able to do it.

What of these both options is better?
I am using an Ubuntu 20.04.01 and the elastic version 7.9.1

Thank you very much

Hi @JorgeL-TI,

Sounds like a cool project.

For FSCrawler, remember that it is not an Elastic product, so this forum probably isn't the right place to seek support for it. However, I can tell you that Workplace Search support in FSCrawler is still unmerged. You can follow its progress here: https://github.com/dadoonet/fscrawler/issues/723, and I am sure that the owner of the project would be happy to assist you with any issues you file.

The Ruby Client, which you've linked there, has instructions for usage in the README. For ingesting documents, you'd be particularly interested in this section. Is there a particular piece that you don't understand how to use?

Hi!

On the other hand, I tried to use FSCrawler for ingesting the pdf document, but I neither am able to do it.

I'll be happy to help. Could you tell more about what you did exactly?

1 Like

Hi,

I have just indexed some PDFs in Linux and I will go to configureFScrawler in Kibana just to know how to do it.
I had some problems (That I have just resulted) writting the path in the YAML file

Regards

Could you share configuration, logs....?

1 Like

Hello,

I share my screenshot: https://gyazo.com/797e27e7bd8c5b6c9f0eb6937ae5c885

My difficult here was configure de Yaml because you have to erase the path.
But It is working with elastic, thank you very much

Bye

I don't understand. Is it working?

1 Like

yes, now I have to try in with workplace.
sorry If I did not explain myself properly

Thank you

Is there anything I can help with?

Not now but I keep in touch. Thank you very much
and thanks again for soving my questions

Hi David,
I have just installed Intellj and I am trying to clone the project (fscrawler-master), but I do not how.

jorge@ubuntu:~/Escritorio/Scripts$ git clone git@github.com:dadoonet/fscrawler.git
Clonando en 'fscrawler'...
Warning: Permanently added the RSA host key for IP address '*******' to the list of known hosts.
git@github.com: Permission denied (publickey).
fatal: No se pudo leer del repositorio remote. 
Por favor asegúrate que tienes los permisos de acceso correctos
y que el repositorio existe.

So I cannot access.
It seems like I do not have the public key

Thank you very much

That's a problem with git and GitHub. I'm afraid I can't really help.

Out of curiosity why do you want to clone it?

1 Like

Hi,
What I wanted was to downloaded and I did it right now.
I am running the different test of this site: https://fscrawler.readthedocs.io/en/wip-workplace_search/dev/build.html

And I ran the elastic search test and it worked


mvn verify -pl fr.pilato.elasticsearch.crawler:fscrawler-it-v7 \
    -Dtests.cluster.user=elastic \
    -Dtests.cluster.pass=changeme \
    -Dtests.cluster.url=http://127.0.0.1:9200 \

and now I am doing the same with workplace search:


sudo mvn docker-compose:up waitfor:waitfor -pl fr.pilato.elasticsearch.crawler:fscrawler-it-v7

but it shows an error:


[ERROR] No plugin found for prefix 'docker-compose' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/root/.m2/repository), central (https://repo.maven.apache.org/maven2)] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoPluginFoundForPrefixException

and it did the same with this command:

sudo mvn verify -pl fr.pilato.elasticsearch.crawler:fscrawler-it-v7 \
    -Dtests.cluster.user=elastic \
    -Dtests.cluster.pass=*****\
    -Dtests.workplace.url=http://127.0.0.1:9200 \
    -Dtests.workplace.access_token=*****\
    -Dtests.workplace.key=*****

Hello,
I am not able to work on the app.

I figured out how to index some files in elastic search, but now I want to index them in workplace search.

This is my the file that I create using FSCrawler:

name: "resumes"
fs:
  url: "//home//jorge//Escritorio//FSCRAWLER//Ficheros"
  update_rate: "1m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  username: "elastic"
  password: "L3pfydSSgRtZxfg5gWmX"
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
workplace_search:
  access_token: "489e56799532ca13c49161f82093a41387fca45458617277705b5e8d0e250e77"
  key: "5f959a6e1d41c88afcdc280e"
  nodes:
  - url: "http://127.0.0.1:3002"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

But it is still working with elastic search but not with workplace.

Secondly, I tried to use the program that you upload to the internet but I am still having problem with docker -compose:
it shows this error:

jorge@ubuntu:~/Escritorio/Proyectos/fscrawler/contrib/docker-compose-example$ docker-compose up
ERROR: Named volume "path_to_files_to_scan:/usr/app/data:ro" is used in service "fscrawler" but no declaration was found in the volumes section.

I read the file but I am not sure if I have to change any data of it.

Thank you very much again, I am finding a nice help here

What url did you use to download FSCrawler ?

The version that is working I downloaded from here( [fscrawler-es7-2.7-20201002.052054-126.zip]), and this is the Yaml that I modified

https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/

But I understand that for using it to make fscrawler work with workplace I have to download the project, and is where I have the problem with the docker -compose

https://fscrawler.readthedocs.io/en/wip-workplace_search/admin/fs/wpsearch.html#

Thank you

So this branch is not pushed to sonatype. I don't think it can work then with workplace search.

The last build I shared is available at https://www.dropbox.com/s/07msuwno2gw3noq/fscrawler-es7-2.7-SNAPSHOT.zip?dl=0

1 Like

Ok, thank you very much.

I am going to try it and I share the results

Thank you

Hello
This is the message that it shows

jorge@ubuntu:~/Escritorio/FSCRAWLER/FSCrawler Workplace/fscrawler-es7-2.7-SNAPSHOT/bin$ ./fscrawler /home/jorge/Escritorio/FSCRAWLER/FSCrawler/resum
17:39:49,299 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [129.8mb/2.8gb=4.44%], RAM [4.4gb/11.7gb=37.55%], Swap [1.9gb/1.9gb=100.0%].
17:39:49,718 INFO  [f.p.e.c.f.c.FsCrawlerCli] Workplace Search integration is an experimental feature. As is it is not fully implemented and settings might change in the future.
17:39:49,719 WARN  [f.p.e.c.f.c.FsCrawlerCli] Workplace Search integration does not support yet watching a directory. It will be able to run only once and exit. We manually force from --loop -1 to --loop 1. If you want to remove this message next time, please start FSCrawler with --loop 1
17:39:49,740 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
17:39:50,471 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.9.2
17:39:50,872 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.9.2
17:39:50,891 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [resum] for [/home/jorge/Escritorio/FSCRAWLER/FSCrawler/Ficheros] every [1m]
17:39:51,015 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
17:39:51,096 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [resum] stopped

so because the program is not able to upload one folder, I am going to try with one file.

Thank you

It looks good. To make sure, start it with --debug --restart options