Can't get extraction rulesets working

CobusT · July 27, 2024, 12:56am

I am trying to configure the crawler to extract content from a specific selector in the html documents (.content-container) into an ES field called my_content. When I run the crawler (v0.2.0) against ES (v.8.14.3) I see body_content populated (with the full html), but there is no field called my-content in the ES document.

I would appreciate any help.

domains:
  - url: https://xxxxx.com		# The base URL for this domain
    seed_urls:
      - https://xxxxxx..com/documentation
    extraction_rulesets:
      - url_filters:
          - type: begins
            pattern: /documentation
        rules:
          - action: extract
            type: extract_content
            selector: .content-container
            field_name: my_content
            join_as: string
            source: html

nfeekery · July 29, 2024, 1:49pm

Hi @CobusT
FIrst, I assume you're working from the main branch, as this feature is only present there. Do you know what git commit sha you're on?
Content extraction by CSS selectors was introduced over two big PRs, so it's possible you have only the first changes, which introduced the configuration and documentation updates. The actual implementation is from a later commit.
Can you try re-pulling main branch and seeing if that fixes the problem?

Regarding the config file itself; the only thing I notice that could be a problem is the config domains[].extraction_rulesets[].rules.type. This config doesn't exist, but I don't think its presence should break anything.

If you've pulled the latest main branch and this still doesn't seem to work, can you share any logs from the crawl job?

CobusT · July 29, 2024, 3:39pm

Ha! Thanks @nfeekery. I was apparently on an older branch. It now works!

nfeekery · July 30, 2024, 8:13am

I'm glad to hear it's working!

We can't guarantee that the main branch has complete features as a lot of things are still in development. I think the latest main sha is fine as of today. However, as we add more work it might become unstable again.

If you want a more stable experience I'd recommend using the 0.1 branch. It won't have as many features but it should be more stable. If you want to keep using the newer features like CSS selectors for content extraction, I'd avoid pulling the main branch again until v0.2.0 is released.

I'll try to use feature branches going forward so this won't happen again

CobusT · July 30, 2024, 2:45pm

Got it.
If you don’t mind a follow up question… what is the overall timeline for a stable 0.2.0? I do need css selectors (very cool feature) and would like to know when we can start thinking of using this ‘for real’.

nfeekery · July 30, 2024, 2:56pm

The current projection for v0.2.0 is mid-late August. We don't have a specific date, though. I can reply here again when it's released to let you know.

Also (just FYI), v0.2.0 will be the beta version, so it won't be a GA (generally available) product yet. This just impacts the support SLA, and also means we can't guarantee backwards compatibility between v0.2.0 and v1.0.0.

It will be promoted to GA when we release a v1.0.0, which we don't have a timeline for yet.

system · August 27, 2024, 2:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Page not indexed if a content extraction rule with CSS selector fails if the references element is not part of the page Elastic Search elastic-app-search	2	202	December 20, 2023
Can you modify web scraper extraction rules for reserved field name: body_content? Elasticsearch	1	95	February 3, 2024
Elastic Web crawler extraction rule support for excluding css selectors with :not Elastic Search elastic-app-search	2	213	November 27, 2023
Web crawler not extracting custom fields Elastic Search elastic-site-search	4	950	July 20, 2021
Elastic crawler metadata content extraction Elastic Search crawler	3	11	November 18, 2024

Can't get extraction rulesets working

Related topics