Page not indexed if a content extraction rule with CSS selector fails if the references element is not part of the page

Hi all,

we are using content extraction rules with CSS selectors as described here: Web crawler content extraction rules | Enterprise Search documentation [8.11] | Elastic

We've found out that a page is NOT indexed if the element referenced in the rule is NOT existing in the page. That means, the crawler is not very fault tolerant.

For example we want do extract a meta tag to the string field displayurl which is referenced by the following CSS selector: html/head/link[@rel="canonical"]/@href

How can we extract information which is not available on each page?

Segards

Sebastian

Hi @sebastianboelling,

We've found out that a page is NOT indexed if the element referenced in the rule is NOT existing in the page.

Am I right in assuming you don't see a document in the Elasticsearch index representing the page if the CSS selector returns an empty result?

Could you please provide a URL if it's public?

Also, could you please have a look at the crawler event logs for the pages that aren't indexed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.