Ingest attachement plugin not able index complex pdf

I am trying to index following file pdf but it fails.
https://drive.google.com/open?id=1KS_ow1mfQKLjv_9zihpkn-No1vq0Z_a7


axios.put(`${ELASTIC_SEARCH_DOC_URL}/_ingest/pipeline/${id}`, {
        "description" : "Extract attachment information",
        "processors" : [
          {
            "attachment" : {
              "field" : "data",
              "indexed_chars": -1,
              "ignore_failure": true,
            }
          }
        ]
      }, {
        headers: {
          "Content-Type": "application/json"
        }
      }).then(() => {

        const url = `${ELASTIC_SEARCH_DOC_URL}/${ELASTIC_SEARCH_DOC_INDEX}/documents/${id}?pipeline=${id}`
        const data = {
          data: base64Encode(file),
          link: f.link,
          tranId: f.tranId,
          ...f.indexingTags,
        }

        axios.put(url, data, {
          headers: {
            "Content-Type": "application/json"
          }
        }).then((iData) => {
          indexedFiles.push(f)
          resolve(iData)
        }).catch(err => {
          fs.appendFileSync(errorFileLogs, `Attachment Pilpeline Failed => ${url}\n${err}\n\n`)
          resolve()
        })
      }).catch(err => {
        console.log(err)
        fs.appendFileSync(errorFileLogs, `Ingest Pilpeline Failed => \n${err}\n`)
        resolve()
      })

I was able to successfully index a simple pdf: http://www.africau.edu/images/default/sample.pdf

Can some one guide how can I fix the above issue

Thanks

What is the error? What are the logs?

It is throwing 400 error in http request, exact error message is
Error: Request failed with status code 400

You probably have other details. Look at the logs.
Otherwise try to reproduce with Kibana console or curl.

Thanks @dadoonet, just found following error


{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"data.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[74, 86, 66, 69, 82, 105, 48, 120, 76, 106, 81, 75, 74, 89, 67, 65, 103, 73, 65, 75, 77, 83, 65, 119, 73, 71, 57, 105, 97, 103]...', original message: bytes can be at most 32766 in length; got 183560",
        "caused_by": {
            "type": "max_bytes_length_exceeded_exception",
            "reason": "bytes can be at most 32766 in length; got 183560"
        }
    },
    "status": 400
}

@dadoonet issue is fixed, I was indexing base64 encoded field also which was quite large

Thanks for the support

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.