Getting error while parsing documents


(VENU AMBATI) #1

Hi Everyone,

I am using Ingest-attachment for indexing documents. I am able to parse text documents (.txt files). When I try to parse .doc or pdf files getting this error.

FILE = /elastic/files/englishAnalyzer.doc
ID = 6

"error" : {
"root_cause" : [
{
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExc
eption[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
",
"header" : {
"processor_type" : "attachment"
}
}
],
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaExcepti
on[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException fro
m org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "parse_exception",
"reason" : "Error parsing document in field [data]",
"caused_by" : {
"type" : "tika_exception",
"reason" : "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079",
"caused_by" : {
"type" : "array_index_out_of_bounds_exception",
"reason" : "-1"
}
}
}
},
"header" : {
"processor_type" : "attachment"
}
},
"status" : 500
}

Please help me to resolve the issue

PFB my template and pipeline configuration.

Template

curl -XPUT 'localhost:9200/_template/template_1?pretty' -H 'Content-Type: application/json' -d'
{
"order": 0,
"template": "policies*",
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"refresh_interval": "1s"
},
"mappings": {
"policy": {
"_all": {
"enabled": false
},
"properties": {
"@timestamp": {
"include_in_all": false,
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"filename": {
"type": "keyword",
"ignore_above": 256
},
"isEnabled": {
"type": "boolean"
},
"data": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"attachment" : {
"properties" : {
"content_length" : { "type": "long" },
"author" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"date" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"language" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"keywords" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
}
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "*",
"mapping": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
]
}
}
}
'

Pipeline

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -d'
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"target_field" : "attachment",
"indexed_chars" : -1,
"ignore_missing" : true
}
}
]
}'


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Can you share somewhere you file? Another thing you can do is to raise an issue with this file attached on Tika project.


(David Pilato) #3

And BTW if you have the full stacktrace in elasticsearch logs, that could help as well.


(VENU AMBATI) #4

Here is the error I am getting:

    "error" : {
"root_cause" : [
{
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"header" : {
"processor_type" : "attachment"
}
}
],
"type" : "exception",
"reason" : "java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];",
"caused_by" : {
"type" : "parse_exception",
"reason" : "Error parsing document in field [data]",
"caused_by" : {
"type" : "tika_exception",
"reason" : "Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079",
"caused_by" : {
"type" : "array_index_out_of_bounds_exception",
"reason" : "-1"
}
}
}
},
"header" : {
"processor_type" : "attachment"
}
},
"status" : 500
}

Please help me to resolve the issue

PFB my template and pipeline configuration.

Template

curl -XPUT 'localhost:9200/_template/template_1?pretty' -H 'Content-Type: application/json' -d'
{
     "order": 0,
     "template": "policies*",
     "settings": {
       "number_of_shards": 1,
       "number_of_replicas": 0,
       "refresh_interval": "1s"
     },
     "mappings": {
       "policy": {
         "_all": {
           "enabled": false
         },
         "properties": {
           "@timestamp": {
             "include_in_all": false,
             "type": "date",
             "format": "strict_date_optional_time||epoch_millis"
           },
           "filename": {
             "type": "keyword",
             "ignore_above": 256
           },
           "isEnabled": {
             "type": "boolean"
           },
           "data": {
             "type": "text",
             "fields": {
               "keyword": {
                 "type": "keyword",
                 "ignore_above": 256
               }
             }
           },
           "attachment" : {
             "properties" : {
               "content_length" : { "type": "long" },
               "author" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword"
                   }
                 }
               },
               "date" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "language" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "name" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "title" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "keywords" : {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256					 
                   }
                 }
               },
               "content_type": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 }
               },
               "content": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword",
                     "ignore_above": 256
                   }
                 },
                 "analyzer": "english",
                 "term_vector": "with_positions_offsets"
               }
            }
          }
         },
         "dynamic_templates": [
           {
             "strings": {
               "match_mapping_type": "*",
               "mapping": {
                 "type": "text",
                 "fields": {
                   "keyword": {
                     "type": "keyword"
                   }
                 }
               }
             }
           }
         ]
       }
     }
   }
'

Pipeline

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -d'
   {
     "description" : "Extract attachment information",
     "processors" : [
       {
         "attachment" : {
           "field" : "data",
           "target_field" : "attachment",
           "indexed_chars" : -1,
           "ignore_missing" : true
         }
       }
     ]
   }'

(David Pilato) #5

Don't repeat yourself. Just editing the initial post would have been fine.

And if you can provide what I asked for that would help.


(VENU AMBATI) #6

Sorry to repeat my query... I am new to here and ElasticSearch.

I couldn't find stacktrace in ElasticSearch logs. Can you please brief what is stacktrace. It will be helpful, if you can provide the path.


(David Pilato) #7

Share the logs.


(VENU AMBATI) #8

Can you please let me know, How do I share the logs. I don't see any attachment/upload option.
I am pasting recent error log here.

[2017-04-27T17:57:14,035][WARN ][r.suppressed             ] path: /policies/policy/6, params: {pipeline=attachment, pretty=1, refresh=true,index=policies, id=6, type=policy}org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
        at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:166) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) [elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.1.jar:5.2.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.1.jar:5.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
Caused by: java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [data]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079]; nested: ArrayIndexOutOfBoundsException[-1];
        ... 11 more
Caused by: org.elasticsearch.ElasticsearchParseException: Error parsing document in field [data]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:145) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@28992079
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) ~[?:?]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
        at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_111]
        at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91) ~[?:?]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:86) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.poi.poifs.filesystem.BlockStore$ChainLoopDetector.claim(BlockStore.java:99) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:168) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142) ~[?:?]
        at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:87) ~[?:?]
        at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:440) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:235) ~[?:?]
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:168) ~[?:?]
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:120) ~[?:?]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[?:?]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[?:?]
        at org.apache.tika.Tika.parseToString(Tika.java:568) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:94) ~[?:?]
        at org.elasticsearch.ingest.attachment.TikaImpl$1.run(TikaImpl.java:91) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_111]
        at org.elasticsearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:91) ~[?:?]
        at org.elasticsearch.ingest.attachment.AttachmentProcessor.execute(AttachmentProcessor.java:86) ~[?:?]
        at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-5.2.1.jar:5.2.1]
        ... 9 more

(David Pilato) #9

Thanks. That's what I was looking for.

If you can share the binary file which produces that error that'd be lovely.


(VENU AMBATI) #10

Dear David,

It is failing for all the files, except ".txt" files.

for example: can you take normal word file copy the below text.

Fox jumped out of window
Sensex jump by 200 points
Temperature jumps in Summers

Regards
Venu


(David Pilato) #11

We have integration tests which are running a bunch of files and this is passing.

So please share what you have. Thanks.


(VENU AMBATI) #12

Dear David,

Thank you for your reply.

In that case, I am doing something wrong. I am sharing my code. Can you please check it ?

#!/bin/sh
# Use > 1 to consume two arguments per pass in the loop (e.g. each
# argument has a corresponding value to go with it).
# Use > 0 to consume one or more arguments per pass in the loop (e.g.
# some arguments don't have a corresponding value to go with it such
# as in the --default example).
# note: if this is set to > 0 the /etc/hosts part is not recognized ( may be a bug )
while [[ $# > 1 ]]
do
key="$1"

case $key in
    -f|--file)
    FILE="$2"
    shift # past argument
    ;;
    -i|--id)
    ID="$2"
    shift # past argument
    ;;
    *)
          # unknown option
    ;;
esac
shift # past argument or value
done

echo FILE = "${FILE}"
echo ID = "${ID}"
fileName=`basename "${FILE}"`

coded=`cat "${FILE}" | tr -d '\n' | perl -MMIME::Base64 -ne 'print encode_base64($_)'`

json="{\"isEnabled\": true, \"filename\": \"${fileName}\", \"data\": \"${coded}\" }"
echo "$json" > json.file
curl -XPUT "http://localhost:9200/policies/policy/${ID}?pipeline=attachment&refresh=true&pretty=1" -d @json.file

Calling the above code:

./indexFile.sh -i 1 -f /elastic/files/englishAnalyzer.docx

Pipeline and template you can see in my previous posts.
I am using
ElasticSearch 5.2.1
ingest-attachment-5.2.1
Attaching screenshot of jars inside ingest-attachment plugin.

Please let me know, how to share doc/pdf file.

Regards
Venu


(David Pilato) #13

It looks good to me. If you need help please share one of your file which does not work so I can try to reproduce.


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.