Hi,
We are currently building a search application based on web crawler.
We are facing some difficulties to customize the "crawl_config".
"crawl_config": {
"crawl_rules": [
{
"policy": "allow",
"url_pattern": "***",
"source": "***"
}
],
"max_crawl_depth": 10,
"seed_urls": [
"***"
],
"sitemap_urls": [],
"domain_allowlist": [
"***"
],
"deduplication_settings": [
{
"fields": [
"body_content",
"domains",
"headings",
"links",
"meta_description",
"meta_keywords",
"title"
],
"url_pattern": "***",
"source": "***"
}
],
"content_extraction_enabled": false,
"content_extraction_mime_types": []
}
In the raw json of our crawl we can see a lot of parameters but within the option "crawl with custom settings" we are limited. I imagine that I should change those settings through the API but I was not able to find the related documentation or objects that I need to change. I only found this documentation : Web crawler API reference | App Search documentation [8.13] | Elastic that seems limited too. For example, it does not address how I can work with "content_extraction_mime_types".
I am new to this part of the elastic stack so any advice/help would be really appreciated.
Thank you very much.
Regards,
Martin