I´m new to Elastic and started a test in the Elastic Cloud with Workplace Search in combination with MS OneDrive. The goal is to use full text search on PDF and office files (DOCX, PPTX, XLSX).
The installation on Azure and the link of my Onedrive as organization source worked fine.
I´m able to search and find all documents based on their metadata, but not based on their content.
According to the install documentation I can´t see how to activate full text search capability.
Can somebody point me in the right direction please?
According to the documentation, the file content should be indexed as well:
The OneDrive connector provided with Workplace Search automatically captures, syncs and indexes the following items:
Stored Files Including ID, File Metadata, File Content, Updated by, and timestamps
Do you have an example of a content which you can not find using the search bar?
thanks for the quick response.
An example is this PDF-file: Flex-Silverlight-HTML5.pdf which contains the word "whitepaper".
The search results are shown as follows:
Sounds like a bug then. @Sean_Story WDYT?
Thanks for linking the example PDF-file, @SloMot, that really helps narrow things down.
There are a few reasons that you might not be getting full text content for a particular file.
- Full text extraction is only enabled for certain file types (doc, docx, html, odt, one, pdf, txt, pptx, rtf, xls, xlsx). Your example file is a well-formed PDF, so that isn't the issue for this file.
- The source document cannot be larger than 20mb. Your sample file is 1.3MB, so that's not the issue.
- The resulting text will not be truncated if it's over 100KB. Your sample file extracts 41kb, so that's not the issue.
- The way the extractors work is in two phases. The first phase retrieves just metadata. The second phase extracts full text content and thumbnails where possible. Has the second "Full Sync" completed? You can look in the "details" page for your content source, under "Recent Activity" to see what the sync status is. If the sync is still running or has failed, that would explain why you're not seeing full text.
If it's not (4), I'd recommend looking through your logs to see if you can find errors or stack traces that might give you clues. But I had no problems indexing that sample PDF.
the file was added about 24 hours ago, there were several completed sysncs since then.
So we can rule (4) out.
Were can I find the logs? I´m running on the elastic cloud #2701152434 as a 7 days test.
I started to look into this and Sean beat me to the punch before I got back to my computer. Just adding a data point: when I tried testing with that specific PDF, at first I was able to reproduce what SloMot saw: I could pull up the document if I searched by title, but not by "whitepaper". After giving it a few minutes to make sure the full sync completed, it was actually then able to search by whitepaper:
I know you said this was over 24 hours for you, so you're not seeing exactly what I'm seeing. Are you able to search for other documents by contents, and this PDF file is an exception, or do no docs match on contents?
Also, are you on
7.10 yet? If not, it's easy to upgrade within the Cloud console and always worth a shot. Then maybe try removing and re-connecting OneDrive. Of note, I tested with Dropbox in case there's some very unexpected difference between the two sources.
Hi, the answers to your questionss are:
(1) there are hundreds of documents (PDF and MS office format) in that onedrive which can be found by metadata, but not by content. So ist a general problem, not only with this example file.
Additional info: Document-level permissions are Disabled for this source.
(2) at the logon sceen it says "Elastic Enterprise Search Version 7.10.0"
I will try to remove and re-connecting the OneDrive today and give you feedback.
Can you tell me were I find any logging information, so I can look for errors?
Problem solved - I removed and re-connected the OneDrive and voila - everything works fine.
Thanks for your help!
Glad to hear it!
Just to answer your last question:
since you're using Elastic Cloud, the logs for Workplace Search aren't currently easily accessible - but you can file a support request to get at them, or ask the support team to help you track down issues. If you have any other problems during your trial, don't hesitate to open a support request!