Using the latest version of Open Crawler and having trouble extracting meta tag values
We have a HTML page as follows:
<html lang="en-ca" xmlns=" http://www.w3.org/1999/xhtml ">
<head>
<!--<meta http-equiv=Content-Type content="text/html; charset=windows-1252">-->
<meta charset="utf-8" />
<title>Example Title </title>
<meta name="dc.date" content="2017-06-23" />
<meta name="topic_obj_id" content="752033de-8aaf-46a4-b92e-d7a9fd0e3fad" />
<meta name="location_obj_id" content="Victoria" />
</head>
<body style="font-family:Calibri; font-size:12pt">
Content Goes Here
</body>
</html
We are trying to use extraction rules to get the fields called dc.date and location_obj_id but they are always coming up empty ""
Extraction rules we are using are below.
Is there something we are doing wrong?
We have tried many alternative in the xpath selector but they all return with the field being empty.
This is is the current attempt.
extraction_rulesets:
- rules:
- action: "extract"
field_name: "dc_date"
selector: /html/head/meta\[@name="dc.date"]/@content
join_as: "string"
source: "html"
- action: "extract"
field_name: "topic_obj_id"
selector: /html/head/meta\[@name="topic_obj_id"]/@content
join_as: "string"
source: "html"