When indexing a repository of files (using fsCrawler or genearlly) what happens under the hood?

Lisahtwy · May 3, 2020, 3:57pm

This might seem silly, but What happens when I index files in a repository?
just want to elaborate my question here..
I have around 20GB of files, when I indexed them using fsCrawler it created an index of size 56mb.
(stats from GET _cat/indices?)
I ran a search query to find a name, lets say "Sam".
In the results I can see full file contents in the "hits>source>content".
My question is how can you convert a 20GB repository to 56mb index with full file contents?
what happens when I index a file or repository of files?
Could you please explain or point me to any documentation on this topic. I searched on the official docs but couldnt find any explanation on this.

-Lisa

dadoonet · May 4, 2020, 8:40am

FSCrawler extracts text and some metadata from a file. Imagine you are crawling a video file of 500mb, no text will be extracted but just few metadata. It means that you are going to index may be 1kb on 500mb. Same goes with PDF or Word documents which has pictures. The pictures won't add any text but contributes to the original size.

That could explain a lot what you are seeing.

Lisahtwy · May 9, 2020, 12:48am

Hi,

What kind of metadata would fsCrawler extract ?
Can I check anywhere? I mean using any API call or something?

is there any document for more details?

-Lisa

dadoonet · May 9, 2020, 3:35am

Things like title, tags....
See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields

system · June 6, 2020, 3:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filesearch solution using ES 5.5.0 Elasticsearch	13	1762	August 30, 2017
Elasticsearch considerations for ingesting large files Elasticsearch	7	2525	May 9, 2020
Fscrawler index large file Elasticsearch	11	790	May 18, 2018
File uploaded using FS Crawler - only the name of file is stored in content filed Elasticsearch	4	379	May 8, 2020
ElasticSearch Indexing question Elasticsearch	22	3820	July 5, 2017

When indexing a repository of files (using fsCrawler or genearlly) what happens under the hood?

Related topics