[ANNOUNCEMENT] - fscrawler 2.5 released

dadoonet · August 4, 2018, 3:41pm

The FSCrawler team is pleased to announce the FSCrawler 2.5 release!

FSCrawler

FS Crawler offers a simple way to index binary files into elasticsearch.

Usage

wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.5/fscrawler-2.5.zip

Start FS crawler with:

bin/fscrawler job_name

FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json).
If the file does not exist, FS crawler will propose to create your first job.

$ bin/fscrawler job_name
18:28:58,174 WARN  [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO  [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO  [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch

Create a directory named /tmp/es or c:\tmp\es, add some files you want to index in it and start again:

$ bin/fscrawler job_name
18:30:34,330 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]

Some of the new features

#585: Add a filter by content option . Thanks to dadoonet.
#584: Ignore files bigger than X . Thanks to dadoonet.
#583: Add hocr option for Tesseract-based OCR . Thanks to dadoonet.
#582: Allow path partial matching . Thanks to dadoonet.
#580: Add support for Last Accessed date and Created date . Thanks to dadoonet.
#577: Add support for cloud id . Thanks to dadoonet.
#567: Add File Permissions to generated documents . Thanks to dadoonet.
#564: Add custom tags to documents . Thanks to gpcmol.
#563: Add support for bulk size in bytes with unit . Thanks to dadoonet.
#520: Allow setting Tesseract path to executable and data . Thanks to dadoonet.

Some of the fixed Bugs

#579: Fix wrong detection of removed settings . Thanks to dadoonet.
#553: excludes doesn't appear to work with subdirectories/paths . Thanks to a344254.
#547: fscrawler throws error when using flag --loop 1 . Thanks to jeanp413.
#544: Allow using store_source without indexing content . Thanks to dadoonet.
#526: Raw fields should be considered as text/keyword . Thanks to dadoonet.
#490: Missing ES pipeline shows up in fscrawler logs but REST API returns JSON with "ok": True . Thanks to shadiakiki1986.
#486: Includes and Excludes should not be case sensitive . Thanks to dadoonet.
#475: add setPipeline call when using REST . Thanks to shadiakiki1986.
#461: ES Pipeline is not working in Rest API . Thanks to suresh-nataraj.
#448: Fscrawler missing the field file.extension when indexing through Rest API . Thanks to suresh-nataraj.
#444: Tesseract not detected on Windows . Thanks to HBKarlHolzinger.
#439: ES Documents missing - Date Mapping issue in RAW field . Thanks to suresh-nataraj.
#409: Indexed document is not deleted . Thanks to faizalpribadi.
#327: Indexing Json document via bulk indexing folders also . Thanks to Spandana-Sai.

Some of the changes

#588: Update Maven plugins and Libs . Thanks to dadoonet.
#569: Update to elasticsearch 6.3.2 . Thanks to dadoonet.
#554: Use _doc doc type instead of doc . Thanks to dadoonet.
#542: Update to Tika 1.18 . Thanks to dadoonet.
#457: Add more info in case of bulk failures . Thanks to dadoonet.

Have fun!
-FSCrawler team

Technical_Stuffer_S · October 27, 2018, 12:57pm

@dadoonet sir i want to import my pdfs into elastic search instance....can you pls provide me the steps for this

dadoonet · October 27, 2018, 2:34pm

@Technical_Stuffer_S The documentation explains all that. If you don't understand the documentation please open a new question in #elasticsearch forum with all what you did. I'll be happy to help.

Topic		Replies	Views
[ANNOUNCEMENT] - FSCrawler 2.7 released Community Ecosystem	1	1503	September 2, 2021
[ANNOUNCEMENT] - FSCrawler 2.6 released Community Ecosystem	1	1662	June 23, 2020
[ANNOUNCEMENT] - FSCrawler 2.9 released Community Ecosystem docker	1	1281	February 7, 2022
[ANNOUNCEMENT] - FSCrawler 2.8 released Community Ecosystem docker	5	1472	February 7, 2022
[ANNOUNCEMENT] - Elasticsearch File System Crawler 2.0.0 released Community Ecosystem	4	4399	July 5, 2017