How to avoid ��� chars when using FS Crawler?

Hello,

I'm trying to use FS Crawler 2.6 in a Windows Server machine to index a huge number of files in my company. It's a very large Windows folders tree in a network drive: 14,92 Tb size, 7,2M files in 2,3M folders. Data are in a remote filer, in the same data center.

In many indexed documents, there are ��� chars.

Why this happens?
How to avoid?

My _settings.json file:

{
  "name" : "disc-files-prd",
  "fs" : {
    "url" : "\\\\mycompany\\mycompany\\myfirstfolder",
    "update_rate" : "120h",
	"indexed_chars" : 100000,
	"includes": [
      "*/*.doc",
      "*/*.pdf",
	  "*/*.csv",
	  "*/*.doc",
	  "*/*.docx",
	  "*/*.ods",
	  "*/*.odp",
	  "*/*.odt",
	  "*/*.pdf",
	  "*/*.pps",
	  "*/*.ppsx",
	  "*/*.ppt",
	  "*/*.pptx",
	  "*/*.rtf",
	  "*/*.txt",
	  "*/*.wps",
	  "*/*.xls",
	  "*/*.xlsx",
	  "*/*.xlsm",
	  "*/*.xps"
    ],
    "excludes": [
       "*/~*", 
	   "*/*.tmp",
	   "*/*.eml",
	   "*/*.jpg",
	   "*/*.png",	   
	   "*/ISC/NP*",
	   "*/ISC_INTL/NP*",
	   "*/ISC_OSC/NP*",
	   "*/_CONTINGENCIA/*",
	   "*/_HISTORICO/*"
	],
    "json_support" : false,
	"follow_symlink" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : false,
    "add_as_inner_object" : true,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : true,
	"ignore_above": "20mb",
    "pdf_ocr" : false,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "url" : "https://elasticsearch.mycompany.com.br"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "4s",
    "byte_size"   : "5mb",
    "path_prefix" : "disc",
    "username"    : "xxxxx-all",
	"password"    : "*******",
	"index"       : "xxxxxx-docs",
	"index_folder": "xxxxxxx-folders"
  },
  "rest" : {
    "url" : "http://127.0.0.1:8080/fscrawler"
  }
}

One indexed example:

{
  "took": 113,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 718042,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "disc-files-docs",
        "_type": "_doc",
        "_id": "c4ca9c74b40f690fc7f1b4442a6a6a9",
        "_score": 1.0,
        "_source": {
          "content": "�� �ɘ����������� �ʰ�������{2474569B-416F-43D7-99A6-5F5AB8242EA0}��������������������������������������������������������������������������������������������@�����Ā�€���$GK$ ISC_INTL_MAPS_AI_NP-3

How to avoid ���� chars ???

Thanks for help

This is a file encoding issue I guess.
Make sure you're using UTF8 everywhere.

Do you have a sample document I can use to reproduce the problem?

Yes, how i send to you? By email?
It is a PDF file.

You can share it on whatever sharing service and share the link here?

It is here.

Thank you very much

So I tried your file and it seems that PDFBox is sending some warnings:

00:05:23,158 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for 0 (0) in font T1
00:05:23,173 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for 1 (2) in font T1
00:05:23,174 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for i255 (3) in font T1
00:05:23,174 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for 3 (4) in font T1
00:05:23,175 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for 4 (5) in font T1
00:05:23,176 WARN  [o.a.p.p.f.PDSimpleFont] No Unicode mapping for 5 (6) in font T1

Not sure if it's related though.

I'd try to raise an issue in PDFBox project or in Tika mailing list and see if this is a bug or not.

BTW I created a branch to test this: https://github.com/dadoonet/fscrawler/tree/test/pdf-encoding

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.