Recommended workflow for indexing many binary docs

I have two options in front of me -

  1. use fscrawler to crawl my directory, use the setting fs:store_source (Base64-encoded document) to add the file as a field to the message, send that message to a pipeline on ES that has the ingest-attachment plugin. If it's a static set of files, then I may not even need fscrawler.... just some simple code to crawl the directory and encode the files before sending them to ES.
  2. use fscrawler to do something... but I'm not sure what it is. It's hard for me to understand it. Does fscrawler somehow replace the ingest-attachment plugin? The docs seem to suggest that it doesn't need ingest-attachment plugin.

ben

You can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.13] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

Indeed FSCrawler does not need a pipeline to work as it does the extraction in FSCrawler instead of elasticsearch.

1 Like

Thank you David!
I am wondering then which would be the better approach for my use case, where I have thousands of binary files to index, many many gigabytes. My current fscrawler settings file uses store_source, and then sends each file to an ES pipeline. The ES pipeline uses the ingest-attachment plugin. I don't use the REST server.

Would it be better to use fscrawler to extract the content from the binary docs, then send that content to a bare (no ingest-attachment plugin) ES index? And should that use the REST server, or some other method?

Here's my current settings file:

name: "job"
fs:
  url: "/projects/stock"
  continue_on_error: true
  index_content: "false"
  indexed_chars: "5000"
  ignore_above: "50mb"
  store_source: true
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  pipeline: "office-docs"
  index: "docs"

The current ES pipeline looks like this:

PUT _ingest/pipeline/office-docs
{
    "description": "Extract attachment information",
    "processors": [
      {
        "attachment": {
            "field": "attachment",
            "indexed_chars": -1
        }
      },
      {
        "remove": {
          "field": "attachment"
        }
      }
    ]
}

Using the fscrawler-only method like this:

name: "test"
fs:
  url: "/projects/stock"
  excludes:
  - "*/~*"
  continue_on_error: true
  index_folders: false
elasticsearch:
  nodes:
  - url: "http://localhost:9200"
  index: "test"

Is this more performant? Will it handle many gigabytes of data better than the ingest-attachment pipeline method?

Ben

I'd not do that. I'd not use store_source option.

Yes. That'd remove a lot of memory pressure on elasticsearch nodes.

No. This is not needed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.