Can someone explain step by step what happens to a word doc file if I sync my enterprise onedrive with Elastic enterprise search.
Will it read all the text and images of the word doc? Will it create a new index, document for word file? will it save a replica of original file?
Hi @ranjanchoudhary ,
Good questions. The answers depend somewhat on which version of Enterprise Search you're using. In the latest versions, Workplace Search will will:
- hit the Microsoft Graph api to request metadata for the file. This includes name, author, edited dates, file size, url, etc
- send this metadata to Elasticsearch to be indexed
- separately, it will kick off a subprocess to process the binary content of the file. This will short-circuit if the file is too large or of an unsupported type. The binary contents of the file are downloaded to an in-memory buffer (hence the size limitations)
- This buffer is passed to Apache Tika to do text extraction. Assuming that it's a well-formed word doc, the output of this should be similar to what you'd get if you select-all-copy-pasted the contents of your word document into a plain-text editor. The words will be there, but much of the formatting (including embedded images) will be removed.
- This plain text is then cleaned up - squishing whitespace, and removing bad unicode replacement characters.
- The resulting text is then upserted into the same document in the same Elasticsearch index as the metadata was sent to previously.
To answer each of your specific questions:
Will it read all the text?
yes
Will it read all the images?
no, image data is dropped
Will it create a new index?
no, all of a content source's data goes to the same index.
Will it create a new document?
Yes, each word doc will correspond to a single Elasticsearch document. This Elasticsearch document contains both the text of the word doc, as well as metadata related to it.
will it save a replica of original file?
No. Elasticsearch is primarily for search, not for storage. And binary data is not particularly valuable for search. Instead, it's expected that you'd pivot from the search result's URL to the binary document in OneDrive.
Hope this helps!
Also, be on the lookout for release notes regarding an Elastic Connector for OneDrive, which should show up in our offering in an upcoming release. The Elastic Connector Framework has a few differences from Workplace Search which give you more control over how your data is processed, and specifically could let you keep the document binary in Elasticsearch if you really wanted to.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.