I'm new to Elastic Enterprise Search, I have a question:
If the source data contains pictures/images, would Elastic ingests the data including the pictures/images and store it to Search Engine shards or just the links to the pictures/images in blobs? If the blobs are stored/indexed on Shards, it would take a lot of space. Do we have any documents specifying on this?
It depends on which feature you're using to ingest data with Enterprise Search.
If you are using Workplace Search Content Sources, see: Content extraction | Workplace Search Guide [8.3] | Elastic
Workplace Search makes every effort to process binary files. Files that are "text like" (office documents, PDFs, html, etc) both have their text extracted. Further, an attempt is made to generate thumbnail images for office documents AND image documents. Rather than store the full-sized image, we store only two small copies of the thumbnail, which saves significant space. This feature can be disabled if the space is still a concern, though it does remove the availability of thumbnails in the default search experience. Other than the thumbnail images, no binary content is persisted in Elasticsearch. We do index links to the original document (image or otherwise)
If you are using the App Search Web Crawler, see: Web crawler reference | Elastic App Search Documentation [8.3] | Elastic. In the App Search Web Crawler, we attempt to extract text from binary documents, similar to what is attempted in Workplace Search. However we do not generate thumbnails today, nor will we persist binary content for any files. We do index a link to the original document, image or otherwise.
If you're using any of our Ingestion APIs (Elasticsearch Index API, App Search documents API, Workplace Search Custom Source API, etc) then we'll index whatever data you send us. Most folks do not index binary documents, but instead process binaries on their end, before sending data to our APIs for ingestion.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.