Understand elastic enterprise search from Business/ product perspective

ranjanchoudhary · August 17, 2023, 6:07pm

Can someone explain step by step what happens to a word doc file if I sync my enterprise onedrive with Elastic enterprise search.
Will it read all the text and images of the word doc? Will it create a new index, document for word file? will it save a replica of original file?

Sean_Story · August 17, 2023, 7:47pm

Hi @ranjanchoudhary ,

Good questions. The answers depend somewhat on which version of Enterprise Search you're using. In the latest versions, Workplace Search will will:

hit the Microsoft Graph api to request metadata for the file. This includes name, author, edited dates, file size, url, etc
send this metadata to Elasticsearch to be indexed
separately, it will kick off a subprocess to process the binary content of the file. This will short-circuit if the file is too large or of an unsupported type. The binary contents of the file are downloaded to an in-memory buffer (hence the size limitations)
This buffer is passed to Apache Tika to do text extraction. Assuming that it's a well-formed word doc, the output of this should be similar to what you'd get if you select-all-copy-pasted the contents of your word document into a plain-text editor. The words will be there, but much of the formatting (including embedded images) will be removed.
This plain text is then cleaned up - squishing whitespace, and removing bad unicode replacement characters.
The resulting text is then upserted into the same document in the same Elasticsearch index as the metadata was sent to previously.

To answer each of your specific questions:

Will it read all the text?

yes

Will it read all the images?

no, image data is dropped

Will it create a new index?

no, all of a content source's data goes to the same index.

Will it create a new document?

Yes, each word doc will correspond to a single Elasticsearch document. This Elasticsearch document contains both the text of the word doc, as well as metadata related to it.

will it save a replica of original file?

No. Elasticsearch is primarily for search, not for storage. And binary data is not particularly valuable for search. Instead, it's expected that you'd pivot from the search result's URL to the binary document in OneDrive.

Hope this helps!
Also, be on the lookout for release notes regarding an Elastic Connector for OneDrive, which should show up in our offering in an upcoming release. The Elastic Connector Framework has a few differences from Workplace Search which give you more control over how your data is processed, and specifically could let you keep the document binary in Elasticsearch if you really wanted to.

system · September 14, 2023, 7:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Workplace Search finds all documents only based on their metadata, not based on their content Elastic Search elastic-workplace-search	12	1406	October 31, 2022
Within Enterprise Search Engine, data ingestion/Storage Elastic Search	3	342	September 15, 2022
Workplace Search - Text extraction from big pdf files Elastic Search	3	377	November 4, 2022
Document Preview in elasticsearch Elastic Search elastic-workplace-search	6	1505	January 19, 2021
Onedrive and App Search Elastic Search elastic-workplace-search	3	428	November 11, 2021

Understand elastic enterprise search from Business/ product perspective

Related topics