Bulk operation failing to index first document without attachment field

russinholi · August 12, 2020, 7:08pm

Hi all I'm using Elasticsearch 7.8 and I'm having an weird situation while using bulk operation to index some documents, that might have or not attachment the them, like if run the following command:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":"1","_index":"t1"}}
{"active_user":false,"content":"Something I wrote", "document_id":"1","topic":"Test1"}
{"index":{"_id":"2","_index":"t1'"}}
{"active_user":false,"content":"Something I wrote", "document_id":"2","topic":"Test2"}
{"index":{"_id":"3","_index":"t1'","pipeline":"attachment"}}
{"data":"<BASE64ENCODEDPDFFILE>", "document_id":"3","topic":"Test PDF"}
`

I get this error on ES log:

{"type": "server", "timestamp": "2020-08-12T19:00:50,149Z", "level": "DEBUG", "component": "o.e.a.b.T.BulkRequestModifier", "cluster.name": "docker-cluster", "node.name": "4e71dc30d5e2", "message": "failed to execute pipeline [_none] for document [t1/_doc/1]", "cluster.uuid": "3YRzz0W_RvGuSiJrIGC3GQ", "node.id": "qlGXQX8uRXum5ZGd9rdPcQ" , 
"stacktrace": ["org.elasticsearch.ingest.IngestProcessorException: ElasticsearchParseException[Error parsing document in field [data]]; nested: 
TikaException[TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@565cd0c5]; nested: IOException[Page tree root must be a dictionary];"

Except for the first document the others 2 are indexed successfully, and it seems like it's waiting a "data" field on the first document, but if I remove the first document the second one raised the same error.

And if I index the first document separated alone on another bulk operation it gets indexed normally.

So it seems to be a problem mixing attachments with other documents.

Can someone help me to understand what I'm doing wrong?

dadoonet · August 12, 2020, 7:56pm

IMO this should be reported as a bug. We should not send an empty content to Tika.

That being said, you can add a on_failure parameter to your pipeline to catch the exception and index the document as is. See https://www.elastic.co/guide/en/elasticsearch/reference/7.8/handling-failure-in-pipelines.html

Christian_Dahlqvist · August 12, 2020, 7:57pm

This is as far as I can tell an existing bug.

dadoonet · August 12, 2020, 8:01pm

Ha right. My answer is wrong

system · September 9, 2020, 8:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.