Confluence Documents Extract

Hi I have configure Confluence connector and i found out that the way it extract, some of the documents extract the html tags together within the body content, something like below :

Will this affect the searching? How do we clean the tags? Thanks,

Hi @aisyaharifin,

Are you using Elastic connector? If so, are you running the connector natively on Cloud?

Hi @Artem_Shelkovnikov ,

Yes Im using Elastic connector, i run the connector self managed and the deployment using docker.

Thanks!

You can try updating connector configuration to add HTML Strip processor: HTML strip processor | Elasticsearch Guide [8.14] | Elastic.

It also looks like a bug in the connector, you can file a bug using this link, we'll prioritise and fix it accordingly!

Hi @Artem_Shelkovnikov ,

I already include the HTML strip in the pipeline, and it works, the html tags clean already.

So it seems like this is unusual behavior as you assume a bug? Supposed Confluence connector will not extract together the html tags is it?

Thanks for your help!

Glad it works!

To me it indeed seems like a bug. I wanted to just clarify, does this field always contain HTML tags, or it's possible that this field contains only plain text in some cases?

If it's always or often HTML, then I'm 100% it's a bug that should be fixed in the connector.

I have filed an issue for the connectors: [Confluence] body content for some records includes HTML tags · Issue #2705 · elastic/connectors · GitHub

We will fix this problem in future releases :slight_smile:

Hi @Artem_Shelkovnikov ,

Thank you for looking into this.

What I notice is that all Confluence post that is created contain the html tags except for pdf,csv or ppt type of attachment, but for type with "page", it will extract the html tags together.


Thank you

Thank you for confirming, I've also mentioned that this happens for pages in the bug issue.

1 Like