Indexing Word documents

Tali · March 13, 2019, 2:55pm

We have about 2 milions Word documents on Windows server and need some search engine.
We store some meta data about each document in SQL Server, like department and so on, the data that is not stored the files themselves. We need to search along the files that belong to specific department.
I tried to use Elasticsearch with FSCrawler. It works fine, but I didn't find an option to add some meta data to the index. I'd like to run C# code that connects to SQL server, finds the meta data by the file name and then the plugin/service will send all these to ES.
So may be I should use another plugin, may be Logstash?

dadoonet · March 13, 2019, 6:19pm

The only way to add other fields with FSCrawler for now is by using the REST API. See https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#additional-tags

I'm planning at some point to have other outputs than elasticsearch like:

But this requires a big refactoring of the whole code and that will not happen in the short term. I'm planning that for the 3.0 version.

Another option could be using the simulate REST option: https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#simulate-upload

You can call it. It won't index in elasticsearch but will just return you back what is supposed to be indexed. It's like "FSCrawler as a service" if you will.

I hope this helps.

system · April 10, 2019, 6:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Integrate ElasticSearch, Logstash and fscrawler Logstash	4	641	July 21, 2020
ElasticSearch Indexing question Elasticsearch	22	3760	July 5, 2017
Importing data to elasticsearch automatically using a folder Elasticsearch	5	1328	May 15, 2018
Filesearch solution using ES 5.5.0 Elasticsearch	13	1714	August 30, 2017
Efficient Metadata Indexing for Large Filesystem in Elasticsearch Elasticsearch	2	133	April 23, 2024

Indexing Word documents

Related topics