Hi @dadoonet,
I have come up with an issue of zip files.
My config is pretty straight forward.
---
name: "test"
fs:
url: "<LOCAL_PATH_OF_DOC_FOLDER>"
update_rate: "15h"
excludes:
- "*/~*"
- "*.zip/*.html"
- "*.zip/*.js"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: false
pdf_strategy: "no_ocr"
follow_symlinks: false
elasticsearch:
nodes:
- url: "<ELASTIC_URL>"
api_key: "<API_KEY>"
index: "test"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: true
push_templates: true
I want to include all files of doc folder. But in the doc folder, I have abc.zip file which I want to extract only pdf and text. I want to exclude other doc types.
I was looking for some config filters which can be applied for zip files. With the above config, I was not able to get desired results.
I had no luck with the below config either.
---
name: "test"
fs:
url: "<LOCAL_PATH_OF_DOC_FOLDER>"
update_rate: "15h"
excludes:
- "*.html"
- "*.js"
- "*.bin"
- "*.exe"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: false
pdf_strategy: "no_ocr"
follow_symlinks: false
elasticsearch:
nodes:
- url: "<ELASTIC_URL>"
api_key: "<API_KEY>"
index: "test"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: true
push_templates: true
And also regarding the content indexed onto ES, zip file data has the file names of all the docs in a zip folder followed by the actual content. Is there a way to get only content and not the file names as well
{
"_index": "test",
"_id": "4",
"_score": 1,
"_source": {
"content": """
abc.pdf
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse suscipit libero
vel diam finibus, ac ullamcorper turpis faucibus. Aenean elementum nunc sed hendrerit
placerat. Aliquam finibus leo non tempus rhoncus. In sit amet tincidunt arcu. Nunc
laoreet, ex nec laoreet sodales, ante massa facilisis felis, vitae tincidunt libero
erat nec justo. Donec cursus, ex eget egestas aliquet, est ipsum vehicula quam, eget
eleifend purus odio quis nisi. Donec non gravida metus, eu malesuada nunc. Vestibulum
ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Sed
consequat rhoncus odio, id feugiat diam. Morbi feugiat nibh metus, id pharetra ipsum
volutpat consectetur. Nunc viverra molestie diam, non auctor massa malesuada auctor.
abc.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse suscipit libero vel diam finibus, ac ullamcorper turpis faucibus. Aenean elementum nunc sed hendrerit placerat. Aliquam finibus leo non tempus rhoncus. In sit amet tincidunt arcu. Nunc laoreet, ex nec laoreet sodales, ante massa facilisis felis, vitae tincidunt libero erat nec justo. Donec cursus, ex eget egestas aliquet, est ipsum vehicula quam, eget eleifend purus odio quis nisi. Donec non gravida metus, eu malesuada nunc. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Sed consequat rhoncus odio, id feugiat diam. Morbi feugiat nibh metus, id pharetra ipsum volutpat consectetur. Nunc viverra molestie diam, non auctor massa malesuada auctor.
ab.js
// the hello world program
console.log('Hello World');
any.html
My First Heading
My first paragraph.
""",
"file": {
"extension": "zip",
"content_type": "application/zip",
"filesize": 12858,
"filename": "abc.zip",
"url": "file:///doc/abc.zip"
},
"path": {
"virtual": "/abc.zip",
"real": "doc/abc.zip"
}
}
}
I appreciate your inputs in resolving these both issues.