AppSearch: Web Crawler - Indexing field with multiple values

Hello,

We are using de WebCrawler to index pages from our websites. We are using some custom metatags to add extra data to each document. We want to add a field "persons" which can contain 0 or more persons. How can we add the metatag(s) such that the webcrawler is fills the index correctly?

Ideally, the end result would be simillair to how the standards fields like "links" are filled.

You can use specific meta tags or data attributes to extract custom fields. Use data attributes when the content is visible on the page or meta tags if the content is not visible.

Docs for App Search web crawler: Web crawler reference | App Search documentation [8.8] | Elastic

Docs for Elastic web crawler: Optimizing web content for the web crawler | Enterprise Search documentation [8.8] | Elastic

The Elastic web crawler also provides a content extraction rules UI. You can use this UI to define rules for extracting data into custom fields. For example, you can provide a CSS selector to extract the content for a specific field.

Extraction rules docs: Web crawler content extraction rules | Enterprise Search documentation [8.8] | Elastic

Thank you for you reply.

We are using app search, not enterprise search. I am also familiar with the documentation about the meta tags to extract custom fields. My question is: How can I make sure the App Search Web Crawler extracts multiple values from a metatag instead of seeing it as a single values?

For Example. A page can be related to multiple people. I want the index for this page to have a field with

persons : ["Tom","Sandra"]

How can I add this to the metatag(s) on that page such that the web crawler fills it in this way? Is it even possible with the App Search web crawler?

Two options I can think of:

1. Use the Elastic crawler to create your index and use the extraction rules UI to create the custom field (array values are supported).

Then create an App Search engine from that index: Create Elasticsearch index engines | App Search documentation [8.8] | Elastic

2. Use the App Search crawler meta tag to write multiple values into a single string.

Locate the Elasticsearch index for the App Search engine: Indices, engines, meta engines, and content sources | Enterprise Search documentation [master] | Elastic

Set up an index pipeline to split the string into multiple values: Ingest pipelines | Elasticsearch Guide [8.8] | Elastic

Thank you very much! Option 2 seems to be exactly what we want.
I have created the Ingest pipeline, and run test with the documents we want to index. This works fine.
Only problem I still have is that I can't find a way to make sure that the ingest pipeline is run whenever my engine crawls a site. How can I add a Ingest pipeline to my engine (index)?

Doh! I overlooked this aspect. This is again a difference between the two crawlers. The newer crawler supports ingest pipelines.

I'm struggling to come up with a solution for this using the App Search crawler :frowning_face:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.