URLs behind text in Excel cells concatenated during ingestion using Ingest Attachment Processor

mbrb · October 21, 2024, 3:59pm

Hi everyone,

I'm using the Ingest Attachment Processor in Elasticsearch to parse and index various document types, including Excel files. The text within the cells is parsed correctly, and it appears separated in the index as expected. However, I'm running into an issue with hyperlinks in the Excel cells.

When the hyperlinks are embedded behind terms (e.g., a word linked to a URL in the cell), they seem to be concatenated into one long string without any separation during the ingestion process. This results in a single, continuous chain of URLs being stored, which causes problems during indexing due to excessive term length.

I suspect this might have something to do with how Apache Tika, which is used by the Ingest Attachment Processor, handles hyperlinks embedded in the text, but I’m not sure if that’s the root cause.

My questions are:

Why might URLs behind text in Excel cells be concatenated during ingestion?
Is there a way to adjust Elasticsearch or Tika configurations to ensure that URLs and text are treated separately during parsing and indexing?

Looking forward to any insights or suggestions!

Thanks in advance!

Topic		Replies	Views
Ingestion attachment processor plugin Elasticsearch	4	325	January 9, 2019
Elasticsearch attachment parsing usecase Elasticsearch	6	725	May 2, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2353	November 9, 2018
Ingest attachment plugin not analysing some html files Elasticsearch	15	1206	March 30, 2018
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3158	February 23, 2017

URLs behind text in Excel cells concatenated during ingestion using Ingest Attachment Processor

Related topics