Hi everyone,
I'm using the Ingest Attachment Processor in Elasticsearch to parse and index various document types, including Excel files. The text within the cells is parsed correctly, and it appears separated in the index as expected. However, I'm running into an issue with hyperlinks in the Excel cells.
When the hyperlinks are embedded behind terms (e.g., a word linked to a URL in the cell), they seem to be concatenated into one long string without any separation during the ingestion process. This results in a single, continuous chain of URLs being stored, which causes problems during indexing due to excessive term length.
I suspect this might have something to do with how Apache Tika, which is used by the Ingest Attachment Processor, handles hyperlinks embedded in the text, but I’m not sure if that’s the root cause.
My questions are:
- Why might URLs behind text in Excel cells be concatenated during ingestion?
- Is there a way to adjust Elasticsearch or Tika configurations to ensure that URLs and text are treated separately during parsing and indexing?
Looking forward to any insights or suggestions!
Thanks in advance!