How to extract metadata using the Webcrawler

Marten · July 14, 2021, 2:39pm

Hi there,
I'm testing the App Search Webcrawler.
Is there a way to extract more metadata than the current standard ones?
The documentation mentions something about adding a template (Web crawler reference | Elastic App Search Documentation [8.4] | Elastic) but I can't find a way to implement this.
How can I enrich my documents with extra data without changing all my webpages?

Best Regards,

Marten

ross.bell · July 14, 2021, 3:26pm

Hey @Marten,

Could you link us to an example page you're crawling, and/or provide a snippet of the tags content from your crawled pages that you're using to attempt custom document attributes? The instructions you link to are indeed the way to accomplish custom document attributes.

Could you also confirm the version of Enterprise Search you're running?

Thanks
Ross

Marten · July 15, 2021, 6:37am

Hi Ross,

Thanks for your quick reply.
The page I want to crawl is:

I'm using the hosted app search service on Elastic Cloud since yesterday, so I guess that it's the latest version.

Best Regards,

Marten

ross.bell · July 15, 2021, 3:55pm

Thanks for providing the example. The documentation you originally link to is the only way currently supported. You will need to modify the crawled page(s) to include <meta ... > tags that the crawler will recognize and pick up as custom fields.

The good news is that we plan to introduce configurability to the crawler in the future that would not require introducing <meta> tags to your crawled content. However, I can't provide a date by which that would be available.

Marten · July 16, 2021, 7:17am

Great, I got it now.
It seems that I just had to add an extra field to the schema with the name of the meta tag.
The crawler then picked it up automatically.
This wasn't entirely clear to me from the documentation, but it's clear to me now.

Thanks for your help,

Marten

system · August 13, 2021, 7:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AppSearch: Web Crawler - Indexing field with multiple values Elastic Search elastic-app-search	6	280	July 11, 2023
Elastic crawler metadata content extraction Elastic Search crawler	3	11	November 18, 2024
AppSearch: Web Crawler - add custom field Elastic Search elastic-app-search	3	440	August 31, 2023
Web crawler not extracting custom fields Elastic Search elastic-site-search	4	950	July 20, 2021
How to customize "crawl_config" parameters for web crawler Elastic Search elastic-app-search	3	172	June 5, 2024

How to extract metadata using the Webcrawler

Related topics