How to Index file system


(Sahas Sahas) #1

I am new to elasticsearch framework and My task is to build a search engine for file server using ES framework. I am using ES 1.7 version. My high level requirements are below

  1. The file should be searchable by its Name, date of creation and its contents(Full Text Search).

2 automatic completion should be enabled so partially entered filenames should also produce search hits.

3 The indexing should support the access control so that a user does not see files in the search result that did not belong to him.

For content based searching , i am thinking to use mapper-attachment plugin. However, i am not sure what would be the impact on performance. In my case , file server size is going to be huge. How can i minimize the total space taken by indexing ?

Also please suggest what can the best way to index file system. There can be scenarios like where file can be updated , deleted etc. How should i handle these scenarios and update\delete index.

Please suggest. Thank you.


(David Pilato) #2

For now, you can use FSRiver project which basically does that but note that this project is not maintained anymore as rivers have been removed in elasticsearch 2.0.

You can wait for logstash to do the same:

What I would do is to do it by myself in the mean time. So: read a filesystem, process files with Apache Tika, and send the generated JSON document to elasticsearch. For that, you need to code in Java.

Makes sense?


(Sahas Sahas) #3

Hi David ,

Thank you for the response!

Your suggestion look good to me. However , I am not sure how to handle re-indexing. Since files can be deleted/updated in the file server so how should I approach for re-indexing? Do I need to re-index entire system in some time interval or there can be some other intelligent way? Please suggest.


(Tin Le) #4

Sounds like doing something similar to rsync, but in a logstash plugin.


(David Pilato) #5

I like this idea!


(David Pilato) #6

Well it depends of the number of files I guess.

The easiest thing is indeed to reindex every time but you will end up with a lot of IO, merges, ...

What I did in FSRiver and Scrutmydocs project was to create a hashed value of the filename and I was storing the last modified date with it. So when I was looking at a file, I was simply looking for its date and if unchanged, was skipping the file.
Same for deletes. Within the filename, I was storing also a encoded path. Then I was comparing each file within that path on es side with the list I just built from the filesystem. Then comparing both lists and removing all non existing files (and sub dirs).

Hope this makes sense.

You can look at the FSRiver source code. I'm not sure it works for every FS. I was using it on Windows at some point and I maintained it on MacOS.


(Sahas Sahas) #7

Thanks David ... I have more clarity now.


#8

I am very new to elasticsearch and have a similar but simpler application where the files are stable and will not be updated. I noticed that comments are from August 2015 timeframe. Are there are any new updates (e.g libraries) to use to index a large 40,000+ html files that reside on a local file system? Thanks.


(David Pilato) #9

Did you try https://github.com/dadoonet/fscrawler?


#10

Thanks for your response. fscrawler does seem like the exact approach. But this is for a quick prototype and as such I'd like to avoid building the tool. I noticed there is a failed status on its build. Could you provide url to stable/best version of prebuild version of fscrawler?
Also, I just wrote a parser for the html files and now they can be stored as csv files. Do you have recommendations (pro/con) on using logstash. Any feedback will be greatly appreciated especially because this is just a quick prototype.
Again thanks for your help.


(David Pilato) #11

As written in docs: https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.1/fscrawler-2.1.zip

The SNAPSHOT version is also ok. Just having some issues with the tests sometimes.

For CSV yes Logstash is very fine.


(system) #12