We are using de WebCrawler to index pages from our websites. We are using some custom metatags to add extra data to each document. We want to add a field "persons" which can contain 0 or more persons. How can we add the metatag(s) such that the webcrawler is fills the index correctly?
Ideally, the end result would be simillair to how the standards fields like "links" are filled.
You can use specific meta tags or data attributes to extract custom fields. Use data attributes when the content is visible on the page or meta tags if the content is not visible.
The Elastic web crawler also provides a content extraction rules UI. You can use this UI to define rules for extracting data into custom fields. For example, you can provide a CSS selector to extract the content for a specific field.
We are using app search, not enterprise search. I am also familiar with the documentation about the meta tags to extract custom fields. My question is: How can I make sure the App Search Web Crawler extracts multiple values from a metatag instead of seeing it as a single values?
For Example. A page can be related to multiple people. I want the index for this page to have a field with
persons : ["Tom","Sandra"]
How can I add this to the metatag(s) on that page such that the web crawler fills it in this way? Is it even possible with the App Search web crawler?
Thank you very much! Option 2 seems to be exactly what we want.
I have created the Ingest pipeline, and run test with the documents we want to index. This works fine.
Only problem I still have is that I can't find a way to make sure that the ingest pipeline is run whenever my engine crawls a site. How can I add a Ingest pipeline to my engine (index)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.