With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6

rkmohapatra · November 2, 2019, 4:31pm

FSCrawler log says this.

[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Malformed content, found extra data after parsing: FIELD_NAME]];
16:10:16,190 ^[[36mDEBUG^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Error caught for [ips-internal-doc-index]/[_doc]/[8f531bfbb22847e4c87c31a17a6284]: ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=failed to parse]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Malformed content, found extra data after parsing: FIELD_NAME]];
16:10:16,191 ^[[33mWARN ^[[m [f.p.e.c.f.c.v.ElasticsearchClientV7] Got [3] failures of [4] requests

Elasticsearch log says this.

"Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: FIELD_NAME",
"at org.elasticsearch.index.mapper.DocumentParser.validateEnd(DocumentParser.java:146) ~[elasticsearch-7.3.0.jar:7.3.0]",
"at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:72) ~[elasticsearch-7.3.0.jar:7.3.0]",
"... 34 more"] }

The same configuration worked fine for me with the same documents in FSCrawler 2.6. I am using all default configuration of Elasticsearch.

dadoonet · November 2, 2019, 4:48pm

Could you share a way to reproduce it (FSCrawler settings and a pdf file causing this?)

rkmohapatra · November 3, 2019, 1:22pm

Here is the job configuration file.
{
"name" : "job-internal-doc-index",
"fs" : {
"url" : "/path/to/docs",
"update_rate" : "30m",
"includes" : [".pdf", ".xls", ".xlsx",".ppt", ".doc",".docx"],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : false,
"store_source" : true,
"index_content" : true,
"indexed_chars" : "-1",
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : true,
"continue_on_error" : false
},
"elasticsearch" : {
"nodes" : [{
"url" : "ES_URL"
}],
"index" : "index1",
"index_folder": "index_folder1",
"pipeline": "ips",
"bulk_size" : 1000,
"flush_interval" : "5s",
"byte_size" : "10mb"
}
}

rkmohapatra · November 3, 2019, 1:24pm

How do I attach the sample pdf file here? It's not allowing to upload pdf here.

dadoonet · November 3, 2019, 2:14pm

Could you share it somewhere and paste the link here?

rkmohapatra · November 3, 2019, 2:32pm

See if you can access the sample document. In fact, almost all documents are failing with the same error. Though FSCrawler can parse the document, extract metadata and create the JSON.

rkmohapatra · November 4, 2019, 2:40pm

Any clue to debug it further? Appreciate your help as we need to resolve it asap.

rkmohapatra · November 5, 2019, 9:57am

@dadoonet It worked for me after I deleted the existing index and recreated it as per the discussion in https://github.com/dadoonet/fscrawler/issues/755. Will come back if I get any more issues. For now, the issue seems resolved.

dadoonet · November 5, 2019, 12:35pm

Great. If you have any idea of what happened and how to reproduce the problem please open an issue in Fscrawler project. Thanks

system · December 3, 2019, 12:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler Elasticsearch	2	2956	September 28, 2017
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2656	December 9, 2018
Some pdf can't be indexed Elasticsearch	3	436	October 22, 2018
Indexing many pdf files Elasticsearch	12	8389	June 16, 2018
FSCrawler Question Elasticsearch	7	3125	March 17, 2017

With FSCrawler 2.7 I am not able to index pdf and other types of documents which worked fine with 2.6

Related topics