Confluence Documents Extract

aisyaharifin · July 5, 2024, 7:25am

Hi I have configure Confluence connector and i found out that the way it extract, some of the documents extract the html tags together within the body content, something like below :

Will this affect the searching? How do we clean the tags? Thanks,

Artem_Shelkovnikov · July 5, 2024, 8:55am

Hi @aisyaharifin,

Are you using Elastic connector? If so, are you running the connector natively on Cloud?

aisyaharifin · July 5, 2024, 10:01am

Hi @Artem_Shelkovnikov ,

Yes Im using Elastic connector, i run the connector self managed and the deployment using docker.

Thanks!

Artem_Shelkovnikov · July 5, 2024, 3:48pm

You can try updating connector configuration to add HTML Strip processor: HTML strip processor | Elasticsearch Guide [8.14] | Elastic.

It also looks like a bug in the connector, you can file a bug using this link, we'll prioritise and fix it accordingly!

aisyaharifin · July 10, 2024, 7:24am

Hi @Artem_Shelkovnikov ,

I already include the HTML strip in the pipeline, and it works, the html tags clean already.

So it seems like this is unusual behavior as you assume a bug? Supposed Confluence connector will not extract together the html tags is it?

Thanks for your help!

Artem_Shelkovnikov · July 11, 2024, 9:01pm

Glad it works!

To me it indeed seems like a bug. I wanted to just clarify, does this field always contain HTML tags, or it's possible that this field contains only plain text in some cases?

If it's always or often HTML, then I'm 100% it's a bug that should be fixed in the connector.

Artem_Shelkovnikov · July 12, 2024, 12:43pm

I have filed an issue for the connectors: [Confluence] body content for some records includes HTML tags · Issue #2705 · elastic/connectors · GitHub

We will fix this problem in future releases

aisyaharifin · July 15, 2024, 4:08am

Hi @Artem_Shelkovnikov ,

Thank you for looking into this.

What I notice is that all Confluence post that is created contain the html tags except for pdf,csv or ppt type of attachment, but for type with "page", it will extract the html tags together.

Thank you

Artem_Shelkovnikov · July 15, 2024, 9:11am

Thank you for confirming, I've also mentioned that this happens for pages in the bug issue.

system · August 12, 2024, 9:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding html_strip filter Elasticsearch	6	316	December 27, 2022
Indexing HTML documents, problems with JSON Elasticsearch	5	980	July 6, 2017
Strip_html Elasticsearch	4	701	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	917	July 6, 2017
Help stripping HTML tags Elasticsearch	6	586	July 6, 2017

Confluence Documents Extract

Related topics