Hi there,
I have some indexes created via the crawler in elastic cloud. These are automatically prefixed with search-
I need to trigger a single page crawl (partial) crawl via an API call.
This API only appear to be available for Engines:
(/api/as/v1/engines/[engine-name]/crawler/crawl_requests)
If I create an "app-search-managed-docs" engine type, - this creates a new hidden index ".ent-search-engine-documents-[engine-name]. The partial crawl api requests against this engine work. But then i have a hidden index, and not the full crawler capability (no content extraction)
Is there a way forward here for an index based engine? I would much prefer to have search-xxx indexes created and fed by Elasticsearch web crawlers, the only thing they don't offer is the partial crawl api call - and the documentation suggested that an index based engine will provide that.
The web crawler for App Search will only work with App Search managed indices. Unfortunately, direct Elasticsearch indexes will not be able to work with it.
However, have you taken a look at the Elastic Open Web Crawler ? Hopefully this will help with what you need to do. Although there is not an API interface to it, you can control the crawling via the CLI, and you may be able to extend the code to suit your needs.
The Open Web Crawler is probably the safer route for future proofing your application, so I would try and use that as there's no guarantee of the App Search crawler adding a feature in to work with non-managed indices.
As for ingest pipelines, not directly - what are you looking to do with ingest pipelines in this context?
Ingest pipelines - would be to parse metatags into document fields, as the crawler that comes with the App Search managed indices does not offer extraction rules to do same.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.