How to customize "crawl_config" parameters for web crawler

Hi,
We are currently building a search application based on web crawler.
We are facing some difficulties to customize the "crawl_config".

  "crawl_config": {
    "crawl_rules": [
      {
        "policy": "allow",
        "url_pattern": "***",
        "source": "***"
      }
    ],
    "max_crawl_depth": 10,
    "seed_urls": [
      "***"
    ],
    "sitemap_urls": [],
    "domain_allowlist": [
      "***"
    ],
    "deduplication_settings": [
      {
        "fields": [
          "body_content",
          "domains",
          "headings",
          "links",
          "meta_description",
          "meta_keywords",
          "title"
        ],
        "url_pattern": "***",
        "source": "***"
      }
    ],
    "content_extraction_enabled": false,
    "content_extraction_mime_types": []
  }

In the raw json of our crawl we can see a lot of parameters but within the option "crawl with custom settings" we are limited. I imagine that I should change those settings through the API but I was not able to find the related documentation or objects that I need to change. I only found this documentation : Web crawler API reference | App Search documentation [8.13] | Elastic that seems limited too. For example, it does not address how I can work with "content_extraction_mime_types".

I am new to this part of the elastic stack so any advice/help would be really appreciated.
Thank you very much.
Regards,

Martin

Hi @Martin_BR,

Unfortunately, the Elastic Web Crawler does not have public APIs today. We're aware that this is a significant limitation, and we're working to improve this, so stay tuned in the coming months.

Today, the mime_types for content extraction can't be controlled crawl-to-crawl or even index-to-index/engine-to-engine. Instead, they're set deployment wide via a YAML configuration in enterprise-search.yml. See: Crawler Binary Content Extraction.

Thank you very much for your answer Sean!
I am looking forward for the future versions :slight_smile:
This can be closed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.