We've found out that a page is NOT indexed if the element referenced in the rule is NOT existing in the page. That means, the crawler is not very fault tolerant.
For example we want do extract a meta tag to the string field displayurl which is referenced by the following CSS selector: html/head/link[@rel="canonical"]/@href
How can we extract information which is not available on each page?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.