Can't get extraction rulesets working

I am trying to configure the crawler to extract content from a specific selector in the html documents (.content-container) into an ES field called my_content. When I run the crawler (v0.2.0) against ES (v.8.14.3) I see body_content populated (with the full html), but there is no field called my-content in the ES document.

I would appreciate any help.

domains:
  - url: https://xxxxx.com		# The base URL for this domain
    seed_urls:
      - https://xxxxxx..com/documentation
    extraction_rulesets:
      - url_filters:
          - type: begins
            pattern: /documentation
        rules:
          - action: extract
            type: extract_content
            selector: .content-container
            field_name: my_content
            join_as: string
            source: html

Hi @CobusT
FIrst, I assume you're working from the main branch, as this feature is only present there. Do you know what git commit sha you're on?
Content extraction by CSS selectors was introduced over two big PRs, so it's possible you have only the first changes, which introduced the configuration and documentation updates. The actual implementation is from a later commit.
Can you try re-pulling main branch and seeing if that fixes the problem?

Regarding the config file itself; the only thing I notice that could be a problem is the config domains[].extraction_rulesets[].rules.type. This config doesn't exist, but I don't think its presence should break anything.

If you've pulled the latest main branch and this still doesn't seem to work, can you share any logs from the crawl job?

Ha! Thanks @nfeekery. I was apparently on an older branch. It now works!

1 Like

I'm glad to hear it's working!

We can't guarantee that the main branch has complete features as a lot of things are still in development. I think the latest main sha is fine as of today. However, as we add more work it might become unstable again.

If you want a more stable experience I'd recommend using the 0.1 branch. It won't have as many features but it should be more stable. If you want to keep using the newer features like CSS selectors for content extraction, I'd avoid pulling the main branch again until v0.2.0 is released.

I'll try to use feature branches going forward so this won't happen again :slight_smile:

Got it.
If you don’t mind a follow up question… what is the overall timeline for a stable 0.2.0? I do need css selectors (very cool feature) and would like to know when we can start thinking of using this ‘for real’.

The current projection for v0.2.0 is mid-late August. We don't have a specific date, though. I can reply here again when it's released to let you know.

Also (just FYI), v0.2.0 will be the beta version, so it won't be a GA (generally available) product yet. This just impacts the support SLA, and also means we can't guarantee backwards compatibility between v0.2.0 and v1.0.0.

It will be promoted to GA when we release a v1.0.0, which we don't have a timeline for yet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.