Is there a way to feed base64 encoded string to the ingest_attachment plugin or fscrawler?

So I've been trying to come up with a poc for document content search.
What I've found so far is, I can feed the documents to fscrawler and search for the content using something like

GET bookstore/_search
{
  "query": {
    "match_phrase": {
      "content": "coursera"
    }
  }
}

But my documents are actually base64 encoded strings and I didn't find a way to feed that into fscrawler. Is that possible? Seems like ingest attachment plugin is another thing I might look into(if I don't want image ocr and other cool fscrawler features later), but I couldn't find a way to use ingest attachment plugin. Of course, I looked the docs and tried doing something like this.

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

PUT my-index-000001/_doc/bookstore?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

PUT my-index-000001/_doc/bookstore?pipeline=attachment
{
  "data": "aGV5IHRoaXMgaXMgY29vbA=="
}


GET my-index-000001/_doc/bookstore

But this doesn't work. I don't much about elasticsearch but was just doing a poc to check if it can be done or not. What's happening in above request is the second put command is replacing the initial one. How can feed multiple documents that are base64 encoded to ingest attachment plugin and search the content?

Try removing bookstore from the URLs. I wonder if this with the removal of document types id interpreted as document id which causes an overwrite.

That doesn't work either, unfortunately. Any other guesses? Why is it so hard to find an working example of searching through multiple documents using ingest attachment preprocessor plugin! I mean, isn't that one of the most common use case of this plugin.. :frowning:

Try this:

PUT my-index-000001/_doc/1?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

PUT my-index-000001/_doc/2?pipeline=attachment
{
  "data": "aGV5IHRoaXMgaXMgY29vbA=="
}

GET my-index-000001/_doc/1

GET my-index-000001/_doc/2

I think you didn't get what I wanted to do. What I want to do is search amongst multiple documents. The queries you gave are just adding something to different indexes and retrieving them. Or if I misunderstood, can you please explain.

What you did was to index and then update a document with the document ID set tobookstore. My example showed how to index 2 different documents and then retrieve them separately so you can inspect them and the result of the ingest pipeline. Once you have ingested your documents and they do not overwrite eachother you can look at the indexed documents and start writing queries.

I don't understand everything. Could you clarify a bit?

But my documents are actually base64 encoded strings

Do you mean that you have files on disk which contains a BASE64 text content? Or do you mean something else?

I don't understand what you are trying to do actually.

The documents are obtained from an API as a base64 string and sent to amazon s3. I actually learned how to use fscrawler from your video on youtube. But on searching a lot, there's no functionality for indexing something that's in s3 as of now. So I thought I should index the documents in elastic search before sending it to s3(while it's in base64 format). Basically, I want lots of letters(obtained from an API as base64 string but are always pdf/docs) to be indexed by fscrawler or ingest attachment plugins and search keywords through them.

True. That's something I'd like to have at some point after I do a big refactoring of the project for the version 3.x.

If I understand correctly the use case, you want to store files on S3 while being able to search for them.
As you have very standard files, like pdfs and docs, I'd probably use the ingest attachment plugin.

In my code, I'd:

  • Store the BASE64 content in S3
  • Get the URL of the S3 bucket (URL)
  • Send the BASE64 content to elasticsearch like this:
POST my-index-000001/_doc?pipeline=attachment
{
  "url": "URL",
  "data": "BASE64"
}

The pipeline I'd use for this is:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }, 
    {
      "remove" : {
        "field" : "data"
      }
    }
  ]
}

HTH

Oh, I've actually tried doing this but didn't understand how to actually search after creating the pipeline. https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment-with-arrays.html Is this what I want to use? This basically creates everything inside of attachments field and when searching for something it either gives me both attachment or none at all. Thanks for the help. :slight_smile:
Also, my documents are available to the application before sending it to s3. Why get the URL and send to elasticsearch instead of doing the same things separately?

You need to provide an example of an indexed document and a query which does not match.

I thought you would like to give a link to the user to the original document in the search response.
If you don't need that, then remove that url I added.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.