How to Index file system

Sahas_Sahas · August 29, 2015, 11:18am

I am new to elasticsearch framework and My task is to build a search engine for file server using ES framework. I am using ES 1.7 version. My high level requirements are below

The file should be searchable by its Name, date of creation and its contents(Full Text Search).

2 automatic completion should be enabled so partially entered filenames should also produce search hits.

3 The indexing should support the access control so that a user does not see files in the search result that did not belong to him.

For content based searching , i am thinking to use mapper-attachment plugin. However, i am not sure what would be the impact on performance. In my case , file server size is going to be huge. How can i minimize the total space taken by indexing ?

Also please suggest what can the best way to index file system. There can be scenarios like where file can be updated , deleted etc. How should i handle these scenarios and update\delete index.

Please suggest. Thank you.

dadoonet · August 29, 2015, 11:38am

For now, you can use FSRiver project which basically does that but note that this project is not maintained anymore as rivers have been removed in elasticsearch 2.0.

You can wait for logstash to do the same:

What I would do is to do it by myself in the mean time. So: read a filesystem, process files with Apache Tika, and send the generated JSON document to elasticsearch. For that, you need to code in Java.

Makes sense?

Sahas_Sahas · August 31, 2015, 5:09am

Hi David ,

Thank you for the response!

Your suggestion look good to me. However , I am not sure how to handle re-indexing. Since files can be deleted/updated in the file server so how should I approach for re-indexing? Do I need to re-index entire system in some time interval or there can be some other intelligent way? Please suggest.

tinle · August 31, 2015, 5:23am

Sounds like doing something similar to rsync, but in a logstash plugin.

dadoonet · August 31, 2015, 7:36am

I like this idea!

dadoonet · August 31, 2015, 7:42am

Well it depends of the number of files I guess.

The easiest thing is indeed to reindex every time but you will end up with a lot of IO, merges, ...

What I did in FSRiver and Scrutmydocs project was to create a hashed value of the filename and I was storing the last modified date with it. So when I was looking at a file, I was simply looking for its date and if unchanged, was skipping the file.
Same for deletes. Within the filename, I was storing also a encoded path. Then I was comparing each file within that path on es side with the list I just built from the filesystem. Then comparing both lists and removing all non existing files (and sub dirs).

Hope this makes sense.

You can look at the FSRiver source code. I'm not sure it works for every FS. I was using it on Windows at some point and I maintained it on MacOS.

Sahas_Sahas · September 1, 2015, 4:30am

Thanks David ... I have more clarity now.

fyoung · November 18, 2016, 5:47pm

I am very new to elasticsearch and have a similar but simpler application where the files are stable and will not be updated. I noticed that comments are from August 2015 timeframe. Are there are any new updates (e.g libraries) to use to index a large 40,000+ html files that reside on a local file system? Thanks.

dadoonet · November 18, 2016, 6:09pm

Did you try https://github.com/dadoonet/fscrawler?

fyoung · November 21, 2016, 7:20pm

Thanks for your response. fscrawler does seem like the exact approach. But this is for a quick prototype and as such I'd like to avoid building the tool. I noticed there is a failed status on its build. Could you provide url to stable/best version of prebuild version of fscrawler?
Also, I just wrote a parser for the html files and now they can be stored as csv files. Do you have recommendations (pro/con) on using logstash. Any feedback will be greatly appreciated especially because this is just a quick prototype.
Again thanks for your help.

dadoonet · November 21, 2016, 7:40pm

As written in docs: https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.1/fscrawler-2.1.zip

The SNAPSHOT version is also ok. Just having some issues with the tests sometimes.

For CSV yes Logstash is very fine.

Topic		Replies	Views
How can Index a Filesystem? Elasticsearch	13	8136	July 5, 2017
Index files on files system in Elasticsearch Elasticsearch	3	365	November 13, 2018
[ANN] Filesystem River for Elasticsearch 0.0.1 Elasticsearch	5	386	July 6, 2017
Index Db content and linked Filesystem content Elasticsearch	3	669	September 11, 2017
Filesearch solution using ES 5.5.0 Elasticsearch	13	1714	August 30, 2017

How to Index file system

Related topics