Page not indexed if a content extraction rule with CSS selector fails if the references element is not part of the page

sebastianboelling · November 7, 2023, 6:49pm

Hi all,

we are using content extraction rules with CSS selectors as described here: Web crawler content extraction rules | Enterprise Search documentation [8.11] | Elastic

We've found out that a page is NOT indexed if the element referenced in the rule is NOT existing in the page. That means, the crawler is not very fault tolerant.

For example we want do extract a meta tag to the string field displayurl which is referenced by the following CSS selector: html/head/link[@rel="canonical"]/@href

How can we extract information which is not available on each page?

Segards

Sebastian

video · November 22, 2023, 8:44pm

Hi @sebastianboelling,

We've found out that a page is NOT indexed if the element referenced in the rule is NOT existing in the page.

Am I right in assuming you don't see a document in the Elasticsearch index representing the page if the CSS selector returns an empty result?

Could you please provide a URL if it's public?

Also, could you please have a look at the crawler event logs for the pages that aren't indexed.

system · December 20, 2023, 8:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't get extraction rulesets working Elastic Search crawler	6	39	August 27, 2024
Elastic Web crawler extraction rule support for excluding css selectors with :not Elastic Search elastic-app-search	2	213	November 27, 2023
Elastic crawler metadata content extraction Elastic Search crawler	3	11	November 18, 2024
Little help needed with crawler content exclusion 7.14 Elastic Search elastic-app-search	9	955	January 11, 2022
Elastic Web crawler not able to parse complete html of a page, other Search engine able to crawl , elastic misses / ignore many sections of page Elastic Search	2	240	November 4, 2022

Page not indexed if a content extraction rule with CSS selector fails if the references element is not part of the page

Related topics