Request to extract metadata fields in the self-hosted content extraction service

Hi,

I'm using the self-hosted content extraction service to be able to ingest office documents larger than 10 MBytes via the Elastic S3 connector, which we need for a client project.

I notice that only the document content is extracted into event field "body" and not metadata fields like "title", "description", "author", etc. When using the Elasticsearch Attachment processor instead, these field are extracted, but that method is limited to <=10 Mbytes documents.

Looking into the data-extraction-service docker container and into the connector code, I find that infact metadata is extracted (I tested that by running a local Apache Tika of the same 2.8.0 version), but the script /app/lua/tika-response-body.lua passes only the X-TIKA:content field to the client, where it is assigned to the "body" field by source.py, line 814.

It looks straightforward to include these metadata fields, like is done for the Elasticsearch/Attachment processor case. Is this planned within short?

Thanks,

Jan Stap

Hi @jan.stap

Thanks for your question!
The Data Extraction Service is still in beta, because we're still trying to gather feedback on how it should behave. On one hand, we want it to be feature rich, on another hand, we don't want to absorb the support burden of 100% of the Tika Server APIs and features. If you're a customer, I'd definitely ask you to work with Support to file an Enhancement Request so that we can capture your use case and plan for it. Alternatively, if you're interested, we can try to connect you directly with one of our product managers (in a less public thread, of course).

Hi Sean,

Thanks! For now we will use Apache Tika directly to ingest office documents into the Elastic Stack, but I'll file the enhancement request with Support, to help you guys plan on further dev on the Data Extraction Service.

Kind regards,
Jan

1 Like

Do note, if your office documents are under 100MB, you can use the Attachment Processor in an Elasticsearch Ingest Pipeline. It is able to return a lot more (if not all) metadata fields.

1 Like

Hi Sean,

Aha, thanks for the info! So you are saying (as I understand) that the 10MB limit is specific to the Elastic Connector?

We do have some documents over 100 MB though. For the <=10MB case we already created an @custom pipeline on the search index, fetching the metadata fields from the Attachment processor output. The default pipeline only passes the .content field of the result.

Thanks,
Jan

@jan.stap yes, the 10mb is/was a limit that's hardcoded in the connector codebase. In its most recent branches, this has been made configurable. The 10mb limit was a legacy holdover that didn't really have a valid justification anymore. If you're on an older branch, feel free to fork the repo and just modify this constant to increase the limit up to ~100MB: connectors/connectors/source.py at 8.12 · elastic/connectors · GitHub

We do have some documents over 100 MB though

The big fish always do. :slight_smile: This is where Tika or the Data Extraction Service shine, because the 100MB limit comes from Elasticsearch's http.max_content_length default value, which we strongly recommend you not change. This limits payload size in Elasticsearch requests, so the Attachment Processor can't receive documents larger than this. So you'd need to choose to either drop such files (often, files this large only negatively impact search relevance anyway), or use a tool that lets you extract content on the edge before sending it to Elasticsearch.

The default pipeline only passes the .content field of the result.

Yeah, this was an intentional tradeoff. We figured it was better to not index more fields (increasing customer costs) by default, and let that be somthing you opt into. In the @custom workaround, hopefully you saw that the metadata fields are available to you to move/copy/process since they're generated before your @custom pipeline and removed after your @custom pipeline.

@jan.stap I wanted to let you know that we just made the repo for the Data Extraction Service public.

Since you'd invested the effort to reverse-engineer the project, I figured you might be excited to know that you now have easier access to the code, and to create Github issues/feature requests.

Hi Sean,

Thanks! Sorry for the late reply, I had an Easter holiday.

Cheers, Jan