Elastic Web crawler extraction rule support for excluding css selectors with :not

Hi,

we are using the Elastic Web Crawler and we are trying to exclude content by using content extraction rules. based on CSS selectors.

To be clear, we do not want to select content from DOM by an rule to an field, we want to exclude and send the "rest" to a field.

An example for such an selector is: main div:not(:has(nav,button))

That should exclude all nav and button tags below all div. It is not working proper. It seems that the crawler does not support multiple selectors separated by a comma, e.g. (nav,button).

If we use a single selector it works, e.g. main div:not(:has(button)). But we can't define multiple rules because the rules are not working as a pipeline and use the output of a rule before.

As mentioned here Web crawler content extraction rules | Enterprise Search documentation [8.10] | Elastic Elastic extraction rules are supporting CSS Level 3 as described here Selectors Level 3 Selectors Level 3 (w3.org)

And CSS Level 3 is supporting group of selectors who act at the end as a "single selector".

On MDN :not() - CSS: Cascading Style Sheets | MDN (mozilla.org) it is also described that comma separated selectors can be used.

  • You can negate several selectors at the same time. Example: :not(.foo, .bar) is equivalent to :not(.foo):not(.bar).

Therefore also :not(selector) or :has(selector) should support examples like :has(nav,button) or :not(nav,button).

Could you have a look into this?

Best regards

Sebastian

Hi Sebastian,

Did you try to chain the selectors like this?

main div:not(:has(nav)):not(:has(button)).

Elastic Web Crawler does support comma-separated selectors however, it's not possible to pass more than one argument to :has method.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.