AppSearch: Web Crawler - Indexing field with multiple values

StefanHeijden · June 6, 2023, 2:19pm

Hello,

We are using de WebCrawler to index pages from our websites. We are using some custom metatags to add extra data to each document. We want to add a field "persons" which can contain 0 or more persons. How can we add the metatag(s) such that the webcrawler is fills the index correctly?

Ideally, the end result would be simillair to how the standards fields like "links" are filled.

chriscressman · June 6, 2023, 3:37pm

You can use specific meta tags or data attributes to extract custom fields. Use data attributes when the content is visible on the page or meta tags if the content is not visible.

Docs for App Search web crawler: Web crawler reference | App Search documentation [8.8] | Elastic

Docs for Elastic web crawler: Optimizing web content for the web crawler | Enterprise Search documentation [8.8] | Elastic

The Elastic web crawler also provides a content extraction rules UI. You can use this UI to define rules for extracting data into custom fields. For example, you can provide a CSS selector to extract the content for a specific field.

Extraction rules docs: Web crawler content extraction rules | Enterprise Search documentation [8.8] | Elastic

StefanHeijden · June 7, 2023, 7:14am

Thank you for you reply.

We are using app search, not enterprise search. I am also familiar with the documentation about the meta tags to extract custom fields. My question is: How can I make sure the App Search Web Crawler extracts multiple values from a metatag instead of seeing it as a single values?

For Example. A page can be related to multiple people. I want the index for this page to have a field with

persons : ["Tom","Sandra"]

How can I add this to the metatag(s) on that page such that the web crawler fills it in this way? Is it even possible with the App Search web crawler?

chriscressman · June 7, 2023, 7:43pm

Two options I can think of:

1. Use the Elastic crawler to create your index and use the extraction rules UI to create the custom field (array values are supported).

Then create an App Search engine from that index: Create Elasticsearch index engines | App Search documentation [8.8] | Elastic

2. Use the App Search crawler meta tag to write multiple values into a single string.

Locate the Elasticsearch index for the App Search engine: Indices, engines, meta engines, and content sources | Enterprise Search documentation [master] | Elastic

Set up an index pipeline to split the string into multiple values: Ingest pipelines | Elasticsearch Guide [8.8] | Elastic

StefanHeijden · June 13, 2023, 9:19am

Thank you very much! Option 2 seems to be exactly what we want.
I have created the Ingest pipeline, and run test with the documents we want to index. This works fine.
Only problem I still have is that I can't find a way to make sure that the ingest pipeline is run whenever my engine crawls a site. How can I add a Ingest pipeline to my engine (index)?

chriscressman · June 13, 2023, 2:52pm

Doh! I overlooked this aspect. This is again a difference between the two crawlers. The newer crawler supports ingest pipelines.

I'm struggling to come up with a solution for this using the App Search crawler

system · July 11, 2023, 2:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AppSearch: Web Crawler - add custom field Elastic Search elastic-app-search	3	440	August 31, 2023
How to extract metadata using the Webcrawler Elastic Search elastic-app-search	5	718	August 13, 2021
Web crawler not extracting custom fields Elastic Search elastic-site-search	4	950	July 20, 2021
Web crawler fields indexed without position data; cannot run PhraseQuery Elastic Search crawler	10	103	October 24, 2024
Elastic crawler metadata content extraction Elastic Search crawler	3	11	November 18, 2024

AppSearch: Web Crawler - Indexing field with multiple values

Related topics