Getting error while parsing documents

Hi Everyone,

I am using Ingest-attachment for indexing documents. I am able to parse text documents (.txt files). When I try to parse .doc or pdf files getting this error.

FILE = /elastic/files/englishAnalyzer.doc
ID = 6

"error" : {
"root_cause" : [
{
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExc
eption[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
",
"header" : {
"processor_type" : "attachment"
}
}
],
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExcepti
on[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException fro
m org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "parse_exception",
"reason" : "Error parsing document in field [data]",
"caused_by" : {
"type" : "tika_exception",
"reason" : "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079",
"caused_by" : {
"type" : "array_index_out_of_bounds_exception",
"reason" : "-1"
}
}
}
},
"header" : {
"processor_type" : "attachment"
}
},
"status" : 500
}

Please help me to resolve the issue

PFB my template and pipeline configuration.

Template

curl -XPUT 'localhost:9200/_template/template_1?pretty' -H 'Content-Type: application/json' -d'
{
"order": 0,
"template": "policies*",
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"refresh_interval": "1s"
},
"mappings": {
"policy": {
"_all": {
"enabled": false
},
"properties": {
"@timestamp": {
"include_in_all": false,
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"filename": {
"type": "keyword",
"ignore_above": 256
},
"isEnabled": {
"type": "boolean"
},
"data": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"attachment" : {
"properties" : {
"content_length" : { "type": "long" },
"author" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"date" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"language" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keywords" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
}
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "*",
"mapping": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
]
}
}
}
'

Pipeline

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -d'
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"target_field" : "attachment",
"indexed_chars" : -1,
"ignore_missing" : true
}
}
]
}'

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Can you share somewhere you file? Another thing you can do is to raise an issue with this file attached on Tika project.

And BTW if you have the full stacktrace in elasticsearch logs, that could help as well.

Here is the error I am getting:

    "error" : {
"root_cause" : [
{
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"header" : {
"processor_type" : "attachment"
}
}
],
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "parse_exception",
"reason" : "Error parsing document in field [data]",
"caused_by" : {
"type" : "tika_exception",
"reason" : "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079",
"caused_by" : {
"type" : "array_index_out_of_bounds_exception",
"reason" : "-1"
}
}
}
},
"header" : {
"processor_type" : "attachment"
}
},
"status" : 500
}

Please help me to resolve the issue

PFB my template and pipeline configuration.

Template

curl -XPUT 'localhost:9200/_template/template_1?pretty' -H 'Content-Type: application/json' -d'
{
     "order": 0,
     "template": "policies*",
     "settings": {
       "number_of_shards": 1,
       "number_of_replicas": 0,
       "refresh_interval": "1s"
     },
     "mappings": {
       "policy": {
         "_all": {
           "enabled": false
         },
         "properties": {
           "@timestamp": {
             "include_in_all": false,
             "type": "date",
             "format": "strict_date_optional_time||epoch_millis"
           },
           "filename": {
             "type": "keyword",
             "ignore_above": 256
           },
           "isEnabled": {
             "type": "boolean"
           },
           "data": {
             "type": "text",
             "fields": {
               "keyword": {
                 "type": "keyword",
                 "ignore_above": 256
               }
             }
           },
           "attachment" : {
             "properties" : {
               "content_length" : { "type": "long" },
               "author" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword"
                   }
                 }
               },
               "date" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "language" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "name" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "title" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "keywords" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256					 
                   }
                 }
               },
               "content_type": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "content": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 },
                 "analyzer": "english",
                 "term_vector": "with_positions_offsets"
               }
            }
          }
         },
         "dynamic_templates": [
           {
             "strings": {
               "match_mapping_type": "*",
               "mapping": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword"
                   }
                 }
               }
             }
           }
         ]
       }
     }
   }
'

Pipeline

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -d'
   {
     "description" : "Extract attachment information",
     "processors" : [
       {
         "attachment" : {
           "field" : "data",
           "target_field" : "attachment",
           "indexed_chars" : -1,
           "ignore_missing" : true
         }
       }
     ]
   }'

Don't repeat yourself. Just editing the initial post would have been fine.

And if you can provide what I asked for that would help.

Sorry to repeat my query... I am new to here and ElasticSearch.

I couldn't find stacktrace in ElasticSearch logs. Can you please brief what is stacktrace. It will be helpful, if you can provide the path.

Share the logs.

Can you please let me know, How do I share the logs. I don't see any attachment/upload option.
I am pasting recent error log here.

[2017-04-27T17:57:14,035][WARN ][r.suppressed             ] path: /policies/policy/6, params: {pipeline=attachment, pretty=1, refresh=true,index=policies, id=6, type=policy}org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
        at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) [elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.1.jar:5.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
        ... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:145) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) ~[?:?]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
        at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_111]
        at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91) ~[?:?]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:86) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:168) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142) ~[?:?]
        at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:87) ~[?:?]
        at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:440) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:235) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:168) ~[?:?]
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:120) ~[?:?]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
        at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_111]
        at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91) ~[?:?]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:86) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more

Thanks. That's what I was looking for.

If you can share the binary file which produces that error that'd be lovely.

Dear David,

It is failing for all the files, except ".txt" files.

for example: can you take normal word file copy the below text.

Fox jumped out of window
Sensex jump by 200 points
Temperature jumps in Summers

Regards
Venu

We have integration tests which are running a bunch of files and this is passing.

So please share what you have. Thanks.

Dear David,

Thank you for your reply.

In that case, I am doing something wrong. I am sharing my code. Can you please check it ?

#!/bin/sh
# Use > 1 to consume two arguments per pass in the loop (e.g. each
# argument has a corresponding value to go with it).
# Use > 0 to consume one or more arguments per pass in the loop (e.g.
# some arguments don't have a corresponding value to go with it such
# as in the --default example).
# note: if this is set to > 0 the /etc/hosts part is not recognized ( may be a bug )
while [[ $# > 1 ]]
do
key="$1"

case $key in
    -f|--file)
    FILE="$2"
    shift # past argument
    ;;
    -i|--id)
    ID="$2"
    shift # past argument
    ;;
    *)
          # unknown option
    ;;
esac
shift # past argument or value
done

echo FILE = "${FILE}"
echo ID = "${ID}"
fileName=`basename "${FILE}"`

coded=`cat "${FILE}" | tr -d '\n' | perl -MMIME::Base64 -ne 'print encode_base64($_)'`

json="{\"isEnabled\": true, \"filename\": \"${fileName}\", \"data\": \"${coded}\" }"
echo "$json" > json.file
curl -XPUT "http://localhost:9200/policies/policy/${ID}?pipeline=attachment&refresh=true&pretty=1" -d @json.file

Calling the above code:

./indexFile.sh -i 1 -f /elastic/files/englishAnalyzer.docx

Pipeline and template you can see in my previous posts.
I am using
ElasticSearch 5.2.1
ingest-attachment-5.2.1
Attaching screenshot of jars inside ingest-attachment plugin.

Please let me know, how to share doc/pdf file.

Regards
Venu

It looks good to me. If you need help please share one of your file which does not work so I can try to reproduce.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.