Extraction rules for meta tag without "elastic" using a XPath selector

creekinouryard · May 21, 2026, 7:05pm

Using the latest version of Open Crawler and having trouble extracting meta tag values
We have a HTML page as follows:

<html lang="en-ca" xmlns=" http://www.w3.org/1999/xhtml ">
<head>
    <!--<meta http-equiv=Content-Type content="text/html; charset=windows-1252">-->
    <meta charset="utf-8" />
    <title>Example Title </title>
    <meta name="dc.date" content="2017-06-23" />
    <meta name="topic_obj_id" content="752033de-8aaf-46a4-b92e-d7a9fd0e3fad" /> 
    <meta name="location_obj_id" content="Victoria" />
</head>
<body style="font-family:Calibri; font-size:12pt">
   Content Goes Here
</body>
</html

We are trying to use extraction rules to get the fields called dc.date and location_obj_id but they are always coming up empty ""
Extraction rules we are using are below.

Is there something we are doing wrong?

We have tried many alternative in the xpath selector but they all return with the field being empty.

This is is the current attempt.

extraction_rulesets:

    - rules:
      - action: "extract" 
        field_name: "dc_date"
        selector: /html/head/meta\[@name="dc.date"]/@content
        join_as: "string"
        source: "html"
      - action: "extract"
         field_name: "topic_obj_id"            
         selector: /html/head/meta\[@name="topic_obj_id"]/@content
         join_as: "string"             
        source: "html"

Asmaa_Oufkir · May 29, 2026, 5:42pm

Always use the form //*[local-name()='meta' and @name='...']/@content in your Open Crawler extraction rules when dealing with XHTML or HTML5 documents containing xmlns.

This is the most portable solution, independent of specific crawler features, and it is robust regardless of the presence (or absence) of namespaces.

If you need to scale (millions of pages), it's best to pre-calculate the values with a small XPath script in Python (lxml) or Java (javax.xml.xpath) to validate your rules before deploying them to Open Crawler.

Topic		Replies	Views
Elastic crawler metadata content extraction Elastic Search crawler	2	84	October 21, 2024
I Want to crawl Meta tag in Head Elasticsearch	0	349	March 19, 2014
Elastic Web crawler extraction rule support for excluding css selectors with :not Elastic Search elastic-app-search	1	263	October 30, 2023
Page not indexed if a content extraction rule with CSS selector fails if the references element is not part of the page Elastic Search elastic-app-search	1	243	November 22, 2023
How to extract metadata using the Webcrawler Elastic Search elastic-app-search	4	883	July 16, 2021

Extraction rules for meta tag without "elastic" using a XPath selector

Related topics