Index the entire file content

Sudharsanam · January 20, 2021, 7:19am

In Appsearch do we have a datatype called "Attachments" similar to Elastic Search.

In Elastic search this is handled by installing the File Ingestion Plugin. This uses Apache Tika for Content Extraction.

If we have this in Appsearch, this will be helpful to index the entire file content.

constancecchen · January 20, 2021, 6:30pm

Heya! App Search currently does not have anything out of the box to ingest files for you - you'd have to set up your own intermediary system of content extraction and send that data/JSON to App Search.

Alternatively, Workplace Search is capable of ingesting files and might be worth checking out as an option. Also CCing @nickchow on this one, feel free to chime in if you have any other alternative solutions for Sudharsanam!

Sudharsanam · January 22, 2021, 9:47am

Thank You Constance Chen.

Setting up our own intermediary system for content extraction works. However there is also a hard restriction of sending max 10 mb as payload to AppSearch.
https://www.elastic.co/guide/en/app-search/current/limits.html

Is there a way to index files, which are much larger in size to AppSearch ?

Subhasis_Dash · January 25, 2021, 10:19am

Hi @nickchow @constancecchen,
If we index one million documents having 10mb of content string after extraction in a single search engine in AppSearch
will it be able to support that without crashing,
And for workplace search please share any documentation on how we can ingest files.

Regards,
Subhasis Dash

orhantoy · January 25, 2021, 3:39pm

@Subhasis_Dash Sounds like you have experience doing something like this with Elasticsearch. How was the performance for the scenario with 1M documents having 10MB of content?

The ability to fine-tune the mapping and settings would probably let you achieve a decent search experience with Elasticsearch, while the challenge with App Search becomes that you don't get much control over the mapping.

Subhasis_Dash · January 27, 2021, 6:20am

Sorry, but I have no experience in using elastic search,
The primary objective is to index the file's text contents in AppSearch. In this case, every file is a document in AppSearch.
We got to do this for millions of files. This not necessarily means that all documents will be of size 10mb.
Is Appsearch a suitable tool for achieving the above?

constancecchen · January 28, 2021, 11:34pm

@Subhasis_Dash - In theory App Search is capable of this, if you're self-hosting on Elastic Cloud or other service, all that's required is for you to scale up your server specs/size/nodes etc. until it can support the level you're looking for. I personally don't have experience using App Search at the million-document scale, although I know we have 1 or 2 customers who have this number of documents.

@Sudharsanam - Unfortunately I believe the 10MB API payload limit is currently a hard-coded cap and is not a configurable limit. My suggestion for now would be breaking up your larger 10MB+ documents up into whatever equivalent of chapters you have, if possible.

system · February 25, 2021, 11:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Appsearch support for large attachments Elastic Search elastic-app-search	5	619	November 19, 2021
Ingest meta data with attachment in App search Elastic Search elastic-app-search	3	459	October 16, 2019
Is it possible to index Files (PDF, DOC, PPT) using App Search? Elastic Search	5	1295	November 4, 2022
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7780	March 29, 2021
How to index files? Elastic Search	2	236	November 4, 2022

Index the entire file content

Related topics