Indexing word, pdf documents?

Can someone please guide me to a step-by-step documentation to index a word or pdf document in elasticsearch ??

I have gone through couple of posts on this and came across FS crawler etc.

I would like to know if there is an official documentation on this topic ?

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

About FSCrawler, there's a tutorial.

HTH

1 Like

Trying to download FSCRAWLER from the download page and getting 404 Not Found

https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/

I tried downloading the zip file and configured the same.

I see the below error while starting up the fscrawler. Any suggestions ?

00:33:01,568 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.9gb/29.9gb=6.35%], RAM [262.2gb/314.5gb=83.38%], Swap [49.9gb/49.9gb=100.0%].
00:33:01,808 WARN [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
00:33:01,808 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.lang.IllegalArgumentException: HTTP Host may not be null
at org.apache.http.util.Args.containsNoBlanks(Args.java:81) ~[httpcore-4.4.13.jar:4.4.13]
at org.apache.http.HttpHost.create(HttpHost.java:108) ~[httpcore-4.4.13.jar:4.4.13]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.lambda$buildRestClient$1(ElasticsearchClientV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.buildRestClient(ElasticsearchClientV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:141) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
00:33:01,817 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [dba_docs] stopped
00:33:01,818 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [dba_docs] stopped

Below is the _settings.yaml

[elk@usncx441 dba_docs]$ cat _settings.yaml
---
name: "dba_docs"
fs:
  url: "/elk/fscrawler_home/attachments"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  - user: "elastic"
  - password: "*****"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

You need to download the SNAPSHOT version for the time being from https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/. Sorry for the confusion.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

The node settings are incorrect. It should be:

elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  user: "elastic"
  password: "*****"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Hello

Yes. I was able to find it out and fix it.

It is now working.

Thanks

Hello,

I have tried to index multiple documents from a single location.

However, the indexing was done only on two documents in a list of more than 20 files.

I tried to check and found that those 2 docs are recently modified.

The remaining docs are older than one year.

I then tried to update some of those and tried to re-index and then it was updated.

Is this expected ??

You should provide more details.

May start with --debug option and share the logs.

Using the --restart option as well will help to scan again all documents.

I will be doing the restart again and confirm the output.

Meanwhile, could you please let me know if it is possible to add a link to a source location of a document via fscrawler and pass it to elasticsearch ?

Below is the scenario as an example :

--> I will index a pdf document into elasticsearch.
--> The original pdf is available at a sharepoint or some external location.
--> I would like to have a link to that source.

Is it possible to do that ???

You should look at workplace search which is built for all that.

Anyway. May be you could use this? https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags

Thank you...Will take a look.

As of now, The workplace seems to be paid product.

I had read that the free version is soon to be released.

Any idea on the timelines for that ?

I have no timeframe for this I'm afraid.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.