Indexing Word documents

We have about 2 milions Word documents on Windows server and need some search engine.
We store some meta data about each document in SQL Server, like department and so on, the data that is not stored the files themselves. We need to search along the files that belong to specific department.
I tried to use Elasticsearch with FSCrawler. It works fine, but I didn't find an option to add some meta data to the index. I'd like to run C# code that connects to SQL server, finds the meta data by the file name and then the plugin/service will send all these to ES.
So may be I should use another plugin, may be Logstash?

The only way to add other fields with FSCrawler for now is by using the REST API. See https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags

I'm planning at some point to have other outputs than elasticsearch like:

But this requires a big refactoring of the whole code and that will not happen in the short term. I'm planning that for the 3.0 version.

Another option could be using the simulate REST option: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#simulate-upload

You can call it. It won't index in elasticsearch but will just return you back what is supposed to be indexed. It's like "FSCrawler as a service" if you will.

I hope this helps. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.