Is it possible to index "external" websites (websites we don't control) an avoid the the header/footer sections, for instance? Perhaps by some trick with the web crawler, tuning, or some other avenue altogether?
The problem we want to solve, for example: when a search term is on a menu on a website, many irrelevant pages are returned because that word is on every header on the site.
I do see that it is possible to exclude sections on a website we control by inserting data-elastic-exclude on an html tag, but this is for websites we do not control.
Background: We are trying to build a search box for our website that will search across about a dozen websites by using the meta engine on App Search. So far we've been able to index most using the web crawler, plus one engine that uses an index made with the confluence cloud connector on workplace search.
Great question. While the web crawler is primarily intended for crawling your own websites today, as long as you're happy to respect the website's robots.txt (we provide no way to override it, for ethical reasons), you can get creative in how you access the data in order to make use of robots meta tags and custom field meta tags.
The author's intent was a little different - he wanted to be able to add custom meta tags inside HTML that he didn't directly control. You, on the other hand, may want to add robots meta tags or the data-elastic-exclude class to exclude the headers and footers. But the same architecture would probably serve for you, where you can set a proxy between the website and your crawler, and use that proxy to dynamically modify the HTML before it reaches the web crawler.
In future releases, we hope to have configs that would allow you to specify CSS selectors or XPath queries in order to facilitate this sort of thing. If you have a support relationship with Elastic, I'd encourage you to file an Enhancement Request that explains your use case, so we can be sure to support it more directly in the future.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.