When indexing a repository of files (using fsCrawler or genearlly) what happens under the hood?

This might seem silly, but What happens when I index files in a repository?
just want to elaborate my question here..
I have around 20GB of files, when I indexed them using fsCrawler it created an index of size 56mb.
(stats from GET _cat/indices?)
I ran a search query to find a name, lets say "Sam".
In the results I can see full file contents in the "hits>source>content".
My question is how can you convert a 20GB repository to 56mb index with full file contents?
what happens when I index a file or repository of files?
Could you please explain or point me to any documentation on this topic. I searched on the official docs but couldnt find any explanation on this.


FSCrawler extracts text and some metadata from a file. Imagine you are crawling a video file of 500mb, no text will be extracted but just few metadata. It means that you are going to index may be 1kb on 500mb. Same goes with PDF or Word documents which has pictures. The pictures won't add any text but contributes to the original size.

That could explain a lot what you are seeing.


What kind of metadata would fsCrawler extract ?
Can I check anywhere? I mean using any API call or something?

is there any document for more details?


Things like title, tags....
See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.