Recommended workflow for indexing many binary docs

ottadini · June 7, 2021, 11:49am

I have two options in front of me -

use fscrawler to crawl my directory, use the setting fs:store_source (Base64-encoded document) to add the file as a field to the message, send that message to a pipeline on ES that has the ingest-attachment plugin. If it's a static set of files, then I may not even need fscrawler.... just some simple code to crawl the directory and encode the files before sending them to ES.
use fscrawler to do something... but I'm not sure what it is. It's hard for me to understand it. Does fscrawler somehow replace the ingest-attachment plugin? The docs seem to suggest that it doesn't need ingest-attachment plugin.

ben

dadoonet · June 7, 2021, 1:58pm

You can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.13] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

Indeed FSCrawler does not need a pipeline to work as it does the extraction in FSCrawler instead of elasticsearch.

ottadini · June 8, 2021, 12:31am

Thank you David!
I am wondering then which would be the better approach for my use case, where I have thousands of binary files to index, many many gigabytes. My current fscrawler settings file uses store_source, and then sends each file to an ES pipeline. The ES pipeline uses the ingest-attachment plugin. I don't use the REST server.

Would it be better to use fscrawler to extract the content from the binary docs, then send that content to a bare (no ingest-attachment plugin) ES index? And should that use the REST server, or some other method?

Here's my current settings file:

name: "job"
fs:
  url: "/projects/stock"
  continue_on_error: true
  index_content: "false"
  indexed_chars: "5000"
  ignore_above: "50mb"
  store_source: true
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  pipeline: "office-docs"
  index: "docs"

The current ES pipeline looks like this:

PUT _ingest/pipeline/office-docs
{
    "description": "Extract attachment information",
    "processors": [
      {
        "attachment": {
            "field": "attachment",
            "indexed_chars": -1
        }
      },
      {
        "remove": {
          "field": "attachment"
        }
      }
    ]
}

Using the fscrawler-only method like this:

name: "test"
fs:
  url: "/projects/stock"
  excludes:
  - "*/~*"
  continue_on_error: true
  index_folders: false
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  index: "test"

Is this more performant? Will it handle many gigabytes of data better than the ingest-attachment pipeline method?

Ben

dadoonet · June 8, 2021, 6:50am

I'd not do that. I'd not use store_source option.

Yes. That'd remove a lot of memory pressure on elasticsearch nodes.

No. This is not needed.

system · July 6, 2021, 6:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to index a file with elasticsearch 5.5.1 Elasticsearch	22	8026	September 1, 2017
Fscrawler save binary , base64 Elasticsearch	7	586	April 18, 2018
How to specify file to Ingest Attachment Elasticsearch	11	4849	March 21, 2017
Binary file indexing Elasticsearch	7	846	August 9, 2018
How Attachments or file storage and searching is handled in Elasticsearch Elasticsearch	7	1670	August 13, 2020

Recommended workflow for indexing many binary docs

Related topics