Dec 5th, 2020: [EN] Searching anything, anywhere with Workplace Search

Version française

Introduction

You already know that Workplace Search comes with a lot of connectors which help you connect your enterprise document data sources and have a federated way to search across all that information. But what if a specific data source is not supported yet?

This post will cover how you can create a custom data source to send your own data. We'll also cover an example of how this was used in the community FSCrawler project.

Workplace Search Custom API

The easiest way to get started is to deploy Enterprise Search on Elastic Cloud. But if you want to run everything locally, you can also checkout this Enterprise Search 7.x - Docker Compose example post which provides a docker way to run the stack.

Once everything has started, open your entreprise search URL ([URL]), launch Workplace Search and open the Sources menu. Choose the Custom API Source and give it a name. For example, Local files.

Once it's created, you will be able to get the Access Token ([ACCESS_TOKEN]) and the Key ([KEY]) from the following screen:

Everything is ready, so we can now start sending content to be indexed in Workplace Search using the Custom API:

curl -XPOST [URL]/api/ws/v1/sources/[KEY]/documents/bulk_create \
-H "Authorization: Bearer [ACCESS_TOKEN]" \
-H "Content-Type: application/json" \
-d '[
  {
    "id" : 1,
    "title" : "Elasticsearch - The Definitive Guide",
    "body" : "Whether you need full-text search or real-time analytics of structured data -- or both -- the Elasticsearch distributed search engine is an ideal way to put your data to work. This practical guide not only shows you how to search, analyze, and explore data with Elasticsearch, but also helps you deal with the complexities of human language, geolocation, and relationships."
  }
]'

You should see:

{"results":[{"id":"1","errors":[]}]}

If you look at the Local files source again, you should see that a document has been indexed.

Which you can verify by opening the default search application.

Indexing your local hard drive

What if we want to index our local drive into Workplace Search? We need to perform the following tasks:

  • Crawl the folder and its sub folders.
  • For each file, extract metadata and the content.
  • When we have enough content hold in memory, flush this content to Workplace Search, using the REST Custom API we just saw.

In 2011, I started to write a project which does the first 2 steps and then send the content into Elasticsearch. But this project was missing a UI.
With Workplace Search available from the basic license at no cost, it became clear that the UI for FSCrawler project, would be Workplace Search.

So I started to write a PR for FSCrawler which adds this Workplace Search integration.

The current branch (at this time of writing) adds:

  • A Workplace Search Java Client as there is no official Java Client available yet.
  • The integration between FSCrawler and this client.

Note: this is still a WIP branch. If you want to try it out, you can build the project yourself or download the binary which is linked in the pull request.

Workplace Search Java Client

To build the Java Client, I used the Jersey Client library.

If you have a List of "key, values" which represents a JSON document, you can simply send this list of documents to Workplace Search by calling a Java method like this one:

public void bulk(String host, String accessToken, String key, List<Map<String, Object>> documents) {
    ClientBuilder.newClient()
            .target(host + "/api/ws/v1/")
            .path("sources/" + key + "/documents/bulk_create")
            .request(MediaType.APPLICATION_JSON)
            .header("Content-Type", "application/json")
            .header("X-Swiftype-Client", "elastic-workplace-search-java")
            .header("X-Swiftype-Client-Version", "7.10.0")
            .header("Authorization", "Bearer " + accessToken)
            .post(Entity.entity(documents, MediaType.APPLICATION_JSON), String.class);
}

Of course, the client I built can do a bit more than that and provides some advanced bulk capabilities, but this is outside of the scope of this post :wink: .

FSCrawler settings for Workplace Search

To connect FSCrawler to Workplace Search, you need to set the following fscrawler job settings:

name: "test"
workplace_search:
 access_token: "[ACCESS_TOKEN]"
 key: "[KEY]"

This will suppose that you want to index documents from /tmp/es in a local instance of Workplace Search connected to a local instance of Elasticsearch.

Of course, you might want to run that on cloud or on another server. So change the settings to:

name: "test"
fs:
  url: /mydocuments
elasticsearch:
  username: "elastic"
  password: "PASSWORD"
  nodes:
  - cloud_id: "CLOUD_ID"
workplace_search:
  access_token: "ACCESS_TOKEN"
  key: "KEY"
  server: "https://XYZ.ent-search.ZONE.CLOUD_PROVIDER.elastic-cloud.com"

CLOUD_ID and PASSWORD are the ones you got when you deployed your cluster.

Start FSCrawler

Just run FSCrawler:

$ bin/fscrawler test -loop 1
,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,            2.7-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'

And wait for a while... When the program exits, it will mean that all documents has been crawled and sent to Workplace Search. If anything went wrong, check the FSCrawler logs in the logs dir.

If you check the Workplace Search admin, you should see that documents have been indexed recently.

Workplace Search defines a default schema for the content, so you will need to adapt it and make sure that dates are seen as actual date type and numbers seen as number type. This can be done in the Schema view.

You also probably want to change the default Display Settings. In the following screenshots, we defined the "Search Results" and "Result Detail" views.

Connect Workplace Search with source documents

If you are running an HTTP server on top of your local files, you will be able not only search for your files and highlight some part of the content but also open the original file from your browser. To start a simple web server using Python, you can do something like:

cd /mydocuments
python3 -m http.server --cgi 80

You can then tell FSCrawler to use this Web server to serve your documents by setting workplace_search.url_prefix. By default, the url will be http://127.0.0.1. It will be used to link your results in Workplace Search to the original binary document.

Search!

Here comes the power!

If you search for a PDF containing for example "words", just type words pdf in the search bar. You'll see that Workplace Search has intelligently detected that pdf is actually an extension. It also shows the text we searched highlighted within the document.

If you click on the hyperlink or "View on Local files" button, it will open the source document using your local Web server (http://127.0.0.1/test-ocr.pdf).

In conclusion, there's always a way to index your own source of documents into Workplace Search and deliver a super powered search experience to your end users. It just needs few clicks in Elastic Cloud and some scripts or code to use the off the shelf API.

Happy Christmas!

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.