With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6

FSCrawler log says this.

[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Malformed content, found extra data after parsing: FIELD_NAME]];
16:10:16,190 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Error caught for [ips-internal-doc-index]/[_doc]/[8f531bfbb22847e4c87c31a17a6284]: ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Malformed content, found extra data after parsing: FIELD_NAME]];
16:10:16,191 ^[[33mWARN ^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Got [3] failures of [4] requests

Elasticsearch log says this.

"Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: FIELD_NAME",
"at org.elasticsearch.index.mapper.DocumentParser.validateEnd(DocumentParser.java:146) ~[elasticsearch-7.3.0.jar:7.3.0]",
"at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:72) ~[elasticsearch-7.3.0.jar:7.3.0]",
"... 34 more"] }

The same configuration worked fine for me with the same documents in FSCrawler 2.6. I am using all default configuration of Elasticsearch.

Could you share a way to reproduce it (FSCrawler settings and a pdf file causing this?)

Here is the job configuration file.
{
"name" : "job-internal-doc-index",
"fs" : {
"url" : "/path/to/docs",
"update_rate" : "30m",
"includes" : [".pdf", ".xls", ".xlsx",".ppt", ".doc",".docx"],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : true,
"index_content" : true,
"indexed_chars" : "-1",
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : true,
"continue_on_error" : false
},
"elasticsearch" : {
"nodes" : [{
"url" : "ES_URL"
}],
"index" : "index1",
"index_folder": "index_folder1",
"pipeline": "ips",
"bulk_size" : 1000,
"flush_interval" : "5s",
"byte_size" : "10mb"
}
}

How do I attach the sample pdf file here? It's not allowing to upload pdf here.

Could you share it somewhere and paste the link here?

See if you can access the sample document. In fact, almost all documents are failing with the same error. Though FSCrawler can parse the document, extract metadata and create the JSON.

Any clue to debug it further? Appreciate your help as we need to resolve it asap.

@dadoonet It worked for me after I deleted the existing index and recreated it as per the discussion in https://github.com/dadoonet/fscrawler/issues/755. Will come back if I get any more issues. For now, the issue seems resolved.

Great. If you have any idea of what happened and how to reproduce the problem please open an issue in Fscrawler project. Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.