Index the entire file content

In Appsearch do we have a datatype called "Attachments" similar to Elastic Search.

In Elastic search this is handled by installing the File Ingestion Plugin. This uses Apache Tika for Content Extraction.

If we have this in Appsearch, this will be helpful to index the entire file content.

1 Like

Heya! App Search currently does not have anything out of the box to ingest files for you - you'd have to set up your own intermediary system of content extraction and send that data/JSON to App Search.

Alternatively, Workplace Search is capable of ingesting files and might be worth checking out as an option. Also CCing @nickchow on this one, feel free to chime in if you have any other alternative solutions for Sudharsanam!

2 Likes

Thank You Constance Chen.

Setting up our own intermediary system for content extraction works. However there is also a hard restriction of sending max 10 mb as payload to AppSearch.
https://www.elastic.co/guide/en/app-search/current/limits.html

Is there a way to index files, which are much larger in size to AppSearch ?

Hi @nickchow @constancecchen,
If we index one million documents having 10mb of content string after extraction in a single search engine in AppSearch
will it be able to support that without crashing,
And for workplace search please share any documentation on how we can ingest files.

Regards,
Subhasis Dash

@Subhasis_Dash Sounds like you have experience doing something like this with Elasticsearch. How was the performance for the scenario with 1M documents having 10MB of content?

The ability to fine-tune the mapping and settings would probably let you achieve a decent search experience with Elasticsearch, while the challenge with App Search becomes that you don't get much control over the mapping.

Sorry, but I have no experience in using elastic search,
The primary objective is to index the file's text contents in AppSearch. In this case, every file is a document in AppSearch.
We got to do this for millions of files. This not necessarily means that all documents will be of size 10mb.
Is Appsearch a suitable tool for achieving the above?

@Subhasis_Dash - In theory App Search is capable of this, if you're self-hosting on Elastic Cloud or other service, all that's required is for you to scale up your server specs/size/nodes etc. until it can support the level you're looking for. I personally don't have experience using App Search at the million-document scale, although I know we have 1 or 2 customers who have this number of documents.

@Sudharsanam - Unfortunately I believe the 10MB API payload limit is currently a hard-coded cap and is not a configurable limit. My suggestion for now would be breaking up your larger 10MB+ documents up into whatever equivalent of chapters you have, if possible.

4 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.