AppSearch: Web Crawler - add custom field

AlanNggg · August 2, 2023, 9:35am

Hello, I am currently using the AppSearch Web Crawler and would like to add custom fields to the crawled documents.

Unfortunately, the websites that the web crawler is crawling do not allow the addition of meta tags for custom fields. As a solution, I have created a proxy server that targets the website domains to crawl all web pages under it and add custom fields, following the instructions on Extract custom fields using web crawler and proxy | App Search documentation [8.9] | Elastic.

However, the URL of the document crawler generates points to the proxy server instead of the actual domain.

Additionally, I have about 5 domains, which means that I would need to create 5 proxy servers for adding custom fields. It has become cumbersome for me to manage all of these proxy servers.

Could you kindly provide me with some advice on this matter?

Furthermore, I have considered using Elastic Web Crawler, but it does not have an API, which I require for crawling websites.

Thanks.

joemcelroy · August 3, 2023, 8:59am

Hi there,

You could use a ingest pipeline to enrich the data before its persisted into elasticsearch index. Using the script processor and painless script, you can write scripts that will add additional fields into each document.

Hope this helps!

Joe

AlanNggg · August 3, 2023, 10:50am

Hi,

Thank you for your response. I attempted to utilize an ingest pipeline for App Search Web Crawler, but unfortunately, it didn't produce the desired results. I discovered that the App Search Web Crawler engine has its own dedicated ingest pipeline specifically designed for binary documents, excluding HTML. Ingest pipeline for App Search document indexing. It seems that custom fields cannot be ingested into HTML documents using this pipeline.

system · August 31, 2023, 10:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AppSearch: Web Crawler - Indexing field with multiple values Elastic Search elastic-app-search	6	280	July 11, 2023
How to extract metadata using the Webcrawler Elastic Search elastic-app-search	5	718	August 13, 2021
Web crawler not extracting custom fields Elastic Search elastic-site-search	4	950	July 20, 2021
How can i update the pipeline used for a app search engine? Elastic Search elastic-app-search	5	219	April 11, 2024
Different engine or an extra field? Elastic Search elastic-app-search	2	376	July 3, 2020

AppSearch: Web Crawler - add custom field

Related topics