Ingest-attachment ingest local docs


(Alina Frey) #1

I am using ingets-attachment plugin of elasticsearch to parse documents, with the intention of doing a search for a word and retrieve the documents that contain that word.

I created a pipeline:

PUT: http://localhost:9200/_ingest/pipeline/pipeline-for-many-attachements
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.data"
          }
        }
      }
    }
  ]
}

Added attachment to an index and process them using the pipeline specified above:

PUT: http://localhost:9200/index-for-many-attachments/doc/0?pipeline=pipeline-for-many-attachements
{
  "attachments" : [
    {
      "filename" : "ipsum.txt",
      "data" : "QWxpbmEgaGFkIGx1bmNoLg=="
    },
    {
      "filename" : "test.txt",
      "data" : "Sm9zaCB3YXMgb24gdmFjYXRpb24u"
    }
  ]
}

Result:
{
    "_index": "index-for-many-attachments",
    "_type": "doc",
    "_id": "0",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "created": true
}

The content of "data" is encoded text in base64.

Get the documents:

GET: http://localhost:9200/index-for-many-attachments/doc/0

Result:
{
    "_index": "index-for-many-attachments",
    "_type": "doc",
    "_id": "0",
    "_version": 2,
    "found": true,
    "_source": {
        "attachments": [
            {
                "filename": "ipsum.txt",
                "data": "QWxpbmEgaGFkIGx1bmNoLg==",
                "attachment": {
                    "content_type": "text/plain; charset=ISO-8859-1",
                    "language": "sk",
                    "content": "Alina had lunch.",
                    "content_length": 17
                }
            },
            {
                "filename": "test.txt",
                "data": "Sm9zaCB3YXMgb24gdmFjYXRpb24u",
                "attachment": {
                    "content_type": "text/plain; charset=ISO-8859-1",
                    "language": "en",
                    "content": "Josh was on vacation.",
                    "content_length": 22
                }
            }
        ]
    }
}

My intention now is to pass real documents from my local machine to the ingest-attachment, and be able to search for words in those local documents.

My question: How do I tell in my call to look for documents stored locally?
For example doc1.pdf and doc2.pdf stored at location path1/doc1.pdf and path2/doc2.pdf.


(David Pilato) #2

In case it helps have a look at FSCrawler project which does that I think.


(Alina Frey) #3

Is FSCrawler going to be used instead of ingest-attachment, or together with it?


(David Pilato) #4

Instead.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.