Ingest attachement plugin not able index complex pdf

lakhansamani · September 25, 2018, 11:23am

I am trying to index following file pdf but it fails.
https://drive.google.com/open?id=1KS_ow1mfQKLjv_9zihpkn-No1vq0Z_a7


axios.put(`${ELASTIC_SEARCH_DOC_URL}/_ingest/pipeline/${id}`, {
        "description" : "Extract attachment information",
        "processors" : [
          {
            "attachment" : {
              "field" : "data",
              "indexed_chars": -1,
              "ignore_failure": true,
            }
          }
        ]
      }, {
        headers: {
          "Content-Type": "application/json"
        }
      }).then(() => {

        const url = `${ELASTIC_SEARCH_DOC_URL}/${ELASTIC_SEARCH_DOC_INDEX}/documents/${id}?pipeline=${id}`
        const data = {
          data: base64Encode(file),
          link: f.link,
          tranId: f.tranId,
          ...f.indexingTags,
        }

        axios.put(url, data, {
          headers: {
            "Content-Type": "application/json"
          }
        }).then((iData) => {
          indexedFiles.push(f)
          resolve(iData)
        }).catch(err => {
          fs.appendFileSync(errorFileLogs, `Attachment Pilpeline Failed => ${url}\n${err}\n\n`)
          resolve()
        })
      }).catch(err => {
        console.log(err)
        fs.appendFileSync(errorFileLogs, `Ingest Pilpeline Failed => \n${err}\n`)
        resolve()
      })

I was able to successfully index a simple pdf: http://www.africau.edu/images/default/sample.pdf

Can some one guide how can I fix the above issue

Thanks

dadoonet · September 25, 2018, 12:58pm

What is the error? What are the logs?

lakhansamani · September 25, 2018, 1:50pm

It is throwing 400 error in http request, exact error message is
Error: Request failed with status code 400

dadoonet · September 25, 2018, 2:35pm

You probably have other details. Look at the logs.
Otherwise try to reproduce with Kibana console or curl.

lakhansamani · September 25, 2018, 4:31pm

Thanks @dadoonet, just found following error


{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560",
        "caused_by": {
            "type": "max_bytes_length_exceeded_exception",
            "reason": "bytes can be at most 32766 in length; got 183560"
        }
    },
    "status": 400
}

lakhansamani · September 26, 2018, 6:08am

@dadoonet issue is fixed, I was indexing base64 encoded field also which was quite large

Thanks for the support

system · October 24, 2018, 6:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017
Using ingest-attachment plugin Elasticsearch	11	1237	December 21, 2016
Index PDF with Ingest Attachement Plugin using NodeJS Client Elasticsearch	4	649	December 8, 2021
Problem ingesting PDF Elasticsearch	1	592	September 23, 2019
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3167	February 23, 2017

Ingest attachement plugin not able index complex pdf

Related topics