Index Db content and linked Filesystem content


#1

Hi,

I need to index database and files in Elastic search. The database has information about the filesystem. It has the file path and file name stored in database along with other fields.

I need to index the database related data and take file name from db , extract content from the file(say word or pdf) and need to index in Elastic search using mapper-attachments/ingest-attachment .

I tried logstash and fscrawler both works fine and index records separately.

Is there any way to index both database content and file system content as single record in elastic search as both are linked?

If no out of box component, how can i index both (db+ file system) records combined in elastic search?

Can i use Apache tika to extract content and index it directly or any best approach available ?


(David Pilato) #2

FSCrawler has a simulate mode: https://github.com/dadoonet/fscrawler#simulate-upload

Which means that you can start FSCrawler as a REST Service.

So, you can imagine doing that in your application:

  • Fetch a record from the DB
  • call curl -F "file=@/path/to/your/file" "http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true"
  • Get the response back
  • Aggregate in your app both data:
    • The one coming from your DB
    • The one coming from FSCrawler
  • Build a new JSon document from that
  • Send that JSon to elasticsearch

Would that work for you?


#3

sure. Will give it a try.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.