Indexing word, pdf documents?

vikramaddagulla · June 4, 2020, 12:59pm

Can someone please guide me to a step-by-step documentation to index a word or pdf document in elasticsearch ??

I have gone through couple of posts on this and came across FS crawler etc.

I would like to know if there is an official documentation on this topic ?

dadoonet · June 4, 2020, 7:42pm

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

About FSCrawler, there's a tutorial.

HTH

vikramaddagulla · June 5, 2020, 2:25am

Trying to download FSCRAWLER from the download page and getting 404 Not Found

https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/

I tried downloading the zip file and configured the same.

I see the below error while starting up the fscrawler. Any suggestions ?

00:33:01,568 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.9gb/29.9gb=6.35%], RAM [262.2gb/314.5gb=83.38%], Swap [49.9gb/49.9gb=100.0%].
00:33:01,808 WARN [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
00:33:01,808 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.lang.IllegalArgumentException: HTTP Host may not be null
at org.apache.http.util.Args.containsNoBlanks(Args.java:81) ~[httpcore-4.4.13.jar:4.4.13]
at org.apache.http.HttpHost.create(HttpHost.java:108) ~[httpcore-4.4.13.jar:4.4.13]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.lambda$buildRestClient$1(ElasticsearchClientV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.buildRestClient(ElasticsearchClientV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:141) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
00:33:01,817 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [dba_docs] stopped
00:33:01,818 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [dba_docs] stopped

vikramaddagulla · June 5, 2020, 4:43am

Below is the _settings.yaml

[elk@usncx441 dba_docs]$ cat _settings.yaml
---
name: "dba_docs"
fs:
  url: "/elk/fscrawler_home/attachments"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  - user: "elastic"
  - password: "*****"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

dadoonet · June 5, 2020, 10:26am

You need to download the SNAPSHOT version for the time being from https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/. Sorry for the confusion.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

The node settings are incorrect. It should be:

elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  user: "elastic"
  password: "*****"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

vikramaddagulla · June 5, 2020, 10:42am

Hello

Yes. I was able to find it out and fix it.

It is now working.

Thanks

vikramaddagulla · June 6, 2020, 4:19pm

Hello,

I have tried to index multiple documents from a single location.

However, the indexing was done only on two documents in a list of more than 20 files.

I tried to check and found that those 2 docs are recently modified.

The remaining docs are older than one year.

I then tried to update some of those and tried to re-index and then it was updated.

Is this expected ??

dadoonet · June 6, 2020, 5:05pm

You should provide more details.

May start with --debug option and share the logs.

Using the --restart option as well will help to scan again all documents.

vikramaddagulla · June 7, 2020, 12:51pm

I will be doing the restart again and confirm the output.

Meanwhile, could you please let me know if it is possible to add a link to a source location of a document via fscrawler and pass it to elasticsearch ?

Below is the scenario as an example :

--> I will index a pdf document into elasticsearch.
--> The original pdf is available at a sharepoint or some external location.
--> I would like to have a link to that source.

Is it possible to do that ???

dadoonet · June 7, 2020, 2:57pm

You should look at workplace search which is built for all that.

Anyway. May be you could use this? https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags

vikramaddagulla · June 7, 2020, 4:30pm

Thank you...Will take a look.

As of now, The workplace seems to be paid product.

I had read that the free version is soon to be released.

Any idea on the timelines for that ?

dadoonet · June 9, 2020, 7:56am

I have no timeframe for this I'm afraid.

system · July 7, 2020, 7:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Recommended workflow for indexing many binary docs Elasticsearch	4	759	July 6, 2021
Can we index .zip file using ingest attachment plugin? Elasticsearch	13	3619	April 25, 2019
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3853	June 26, 2018
How to index a file with elasticsearch 5.5.1 Elasticsearch	22	7947	September 1, 2017

Indexing word, pdf documents?

Related topics