Ingest attachement plugin not able index complex pdf


(Lakhan Samani) #1

I am trying to index following file pdf but it fails.
https://drive.google.com/open?id=1KS_ow1mfQKLjv_9zihpkn-No1vq0Z_a7


axios.put(`${ELASTIC_SEARCH_DOC_URL}/_ingest/pipeline/${id}`, {
        "description" : "Extract attachment information",
        "processors" : [
          {
            "attachment" : {
              "field" : "data",
              "indexed_chars": -1,
              "ignore_failure": true,
            }
          }
        ]
      }, {
        headers: {
          "Content-Type": "application/json"
        }
      }).then(() => {

        const url = `${ELASTIC_SEARCH_DOC_URL}/${ELASTIC_SEARCH_DOC_INDEX}/documents/${id}?pipeline=${id}`
        const data = {
          data: base64Encode(file),
          link: f.link,
          tranId: f.tranId,
          ...f.indexingTags,
        }

        axios.put(url, data, {
          headers: {
            "Content-Type": "application/json"
          }
        }).then((iData) => {
          indexedFiles.push(f)
          resolve(iData)
        }).catch(err => {
          fs.appendFileSync(errorFileLogs, `Attachment Pilpeline Failed => ${url}\n${err}\n\n`)
          resolve()
        })
      }).catch(err => {
        console.log(err)
        fs.appendFileSync(errorFileLogs, `Ingest Pilpeline Failed => \n${err}\n`)
        resolve()
      })

I was able to successfully index a simple pdf: http://www.africau.edu/images/default/sample.pdf

Can some one guide how can I fix the above issue

Thanks


(David Pilato) #2

What is the error? What are the logs?


(Lakhan Samani) #3

It is throwing 400 error in http request, exact error message is
Error: Request failed with status code 400


(David Pilato) #4

You probably have other details. Look at the logs.
Otherwise try to reproduce with Kibana console or curl.


(Lakhan Samani) #5

Thanks @dadoonet, just found following error


{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560",
        "caused_by": {
            "type": "max_bytes_length_exceeded_exception",
            "reason": "bytes can be at most 32766 in length; got 183560"
        }
    },
    "status": 400
}

(Lakhan Samani) #6

@dadoonet issue is fixed, I was indexing base64 encoded field also which was quite large

Thanks for the support


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.