Hello, I am currently using the AppSearch Web Crawler and would like to add custom fields to the crawled documents.
Unfortunately, the websites that the web crawler is crawling do not allow the addition of meta tags for custom fields. As a solution, I have created a proxy server that targets the website domains to crawl all web pages under it and add custom fields, following the instructions on Extract custom fields using web crawler and proxy | App Search documentation [8.9] | Elastic.
However, the URL of the document crawler generates points to the proxy server instead of the actual domain.
Additionally, I have about 5 domains, which means that I would need to create 5 proxy servers for adding custom fields. It has become cumbersome for me to manage all of these proxy servers.
Could you kindly provide me with some advice on this matter?
Furthermore, I have considered using Elastic Web Crawler, but it does not have an API, which I require for crawling websites.
You could use a ingest pipeline to enrich the data before its persisted into elasticsearch index. Using the script processor and painless script, you can write scripts that will add additional fields into each document.
Thank you for your response. I attempted to utilize an ingest pipeline for App Search Web Crawler, but unfortunately, it didn't produce the desired results. I discovered that the App Search Web Crawler engine has its own dedicated ingest pipeline specifically designed for binary documents, excluding HTML. Ingest pipeline for App Search document indexing. It seems that custom fields cannot be ingested into HTML documents using this pipeline.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.