I am trying to configure the crawler to extract content from a specific selector in the html documents (.content-container) into an ES field called my_content. When I run the crawler (v0.2.0) against ES (v.8.14.3) I see body_content populated (with the full html), but there is no field called my-content in the ES document.
I would appreciate any help.
domains:
- url: https://xxxxx.com # The base URL for this domain
seed_urls:
- https://xxxxxx..com/documentation
extraction_rulesets:
- url_filters:
- type: begins
pattern: /documentation
rules:
- action: extract
type: extract_content
selector: .content-container
field_name: my_content
join_as: string
source: html
Hi @CobusT
FIrst, I assume you're working from the main branch, as this feature is only present there. Do you know what git commit sha you're on?
Content extraction by CSS selectors was introduced over two big PRs, so it's possible you have only the first changes, which introduced the configuration and documentation updates. The actual implementation is from a later commit.
Can you try re-pulling main branch and seeing if that fixes the problem?
Regarding the config file itself; the only thing I notice that could be a problem is the config domains[].extraction_rulesets[].rules.type. This config doesn't exist, but I don't think its presence should break anything.
If you've pulled the latest main branch and this still doesn't seem to work, can you share any logs from the crawl job?
We can't guarantee that the main branch has complete features as a lot of things are still in development. I think the latest main sha is fine as of today. However, as we add more work it might become unstable again.
If you want a more stable experience I'd recommend using the 0.1 branch. It won't have as many features but it should be more stable. If you want to keep using the newer features like CSS selectors for content extraction, I'd avoid pulling the main branch again until v0.2.0 is released.
I'll try to use feature branches going forward so this won't happen again
Got it.
If you don’t mind a follow up question… what is the overall timeline for a stable 0.2.0? I do need css selectors (very cool feature) and would like to know when we can start thinking of using this ‘for real’.
The current projection for v0.2.0 is mid-late August. We don't have a specific date, though. I can reply here again when it's released to let you know.
Also (just FYI), v0.2.0 will be the beta version, so it won't be a GA (generally available) product yet. This just impacts the support SLA, and also means we can't guarantee backwards compatibility between v0.2.0 and v1.0.0.
It will be promoted to GA when we release a v1.0.0, which we don't have a timeline for yet.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.