Hello,
I'm trying to use FS Crawler 2.6 in a Windows Server machine to index a huge number of files in my company. It's a very large Windows folders tree in a network drive: 14,92 Tb size, 7,2M files in 2,3M folders. Data are in a remote filer, in the same data center.
In many indexed documents, there are ��� chars.
Why this happens?
How to avoid?
My _settings.json file:
{
"name" : "disc-files-prd",
"fs" : {
"url" : "\\\\mycompany\\mycompany\\myfirstfolder",
"update_rate" : "120h",
"indexed_chars" : 100000,
"includes": [
"*/*.doc",
"*/*.pdf",
"*/*.csv",
"*/*.doc",
"*/*.docx",
"*/*.ods",
"*/*.odp",
"*/*.odt",
"*/*.pdf",
"*/*.pps",
"*/*.ppsx",
"*/*.ppt",
"*/*.pptx",
"*/*.rtf",
"*/*.txt",
"*/*.wps",
"*/*.xls",
"*/*.xlsx",
"*/*.xlsm",
"*/*.xps"
],
"excludes": [
"*/~*",
"*/*.tmp",
"*/*.eml",
"*/*.jpg",
"*/*.png",
"*/ISC/NP*",
"*/ISC_INTL/NP*",
"*/ISC_OSC/NP*",
"*/_CONTINGENCIA/*",
"*/_HISTORICO/*"
],
"json_support" : false,
"follow_symlink" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : false,
"add_as_inner_object" : true,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : true,
"ignore_above": "20mb",
"pdf_ocr" : false,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"url" : "https://elasticsearch.mycompany.com.br"
} ],
"bulk_size" : 100,
"flush_interval" : "4s",
"byte_size" : "5mb",
"path_prefix" : "disc",
"username" : "xxxxx-all",
"password" : "*******",
"index" : "xxxxxx-docs",
"index_folder": "xxxxxxx-folders"
},
"rest" : {
"url" : "http://127.0.0.1:8080/fscrawler"
}
}
One indexed example:
{
"took": 113,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 718042,
"max_score": 1.0,
"hits": [
{
"_index": "disc-files-docs",
"_type": "_doc",
"_id": "c4ca9c74b40f690fc7f1b4442a6a6a9",
"_score": 1.0,
"_source": {
"content": "�� �ɘ����������� �ʰ�������{2474569B-416F-43D7-99A6-5F5AB8242EA0}��������������������������������������������������������������������������������������������@�����Ā����$GK$ ISC_INTL_MAPS_AI_NP-3
How to avoid ���� chars ???
Thanks for help