Elasticsearch fscrawler

Yemboaro · July 8, 2024, 4:46pm

hello elastic community
I use elasticsearch and fscrawler for the track of file but my problem is that large file upload is not processed by ocr however small files are processed well so I ask for your help on how to configure the fscrawler pipeline so that it takes into account the files of huge sizes
this my fscrawler pipelines
PUT _ingest/pipeline/blog_app_index_pipeline
{
"description": "fscrawler pipeline for scrawling blog_app files indexing",
"processors": [
{
"set": {
"field": "_id",
"value": "files{{_source.path.virtual}}"
}
},
{
"set": {
"field": "data_source",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_store",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_origin",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "id_origin",
"value": "filename"
}
},
{
"set": {
"field": "id_value",
"value": "{{_source.file.filename}}"
}
},

    {
        "set": {
            "field": "id_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    
   
    {
        "set": {
            "field": "lib_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    {
        "set": {
            "field": "url_fichier",
            "value": "files{{_source.path.virtual}}"
        }
    },
    
    {
        "set": {
            "field": "data_content",
            "value": "{{_source.content}}"
        }
    },
    
    
    {
        "set": {
            "field": "created_at",
            "value": "{{_source.file.created}}"
        }
    },
    {
        "set": {
            "field": "updated_at",
            "value": "{{_source.file.last_modified}}"
        }
    },
    {
        "set": {
            "field": "deleted_at",
            "value": "null"
        }
    }

]

}

dadoonet · July 8, 2024, 8:41pm

From Kibana to Elasticsearch

dadoonet · July 8, 2024, 8:42pm

What is the error message? Is the problem with fscrawler or with the pipeline?

What are the jvm settings for fscrawler if any?

Yemboaro · July 9, 2024, 9:40am

There is no error as such but it is the fact that the upload of a large file is not processed by the ocr fscrawler therefore I cannot do research on the file content I thought it's my pipeline configuration which doesn't take into account large files
here is the content of my setting.yaml file

name: "blog_app_index"
fs:
url: "/my/files"
update_rate: "3m"
excludes:

"/~"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng+fra"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
pipeline: "blog_app_index_pipeline"
nodes:
url: "url elastic localhost port 9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
what is the probleme in this I don't understand why the upload a large size file the processing of fscrawler(ocr) don't work

dadoonet · July 9, 2024, 10:15am

Did you try to increase the JVM heap? See JVM Settings — FSCrawler 2.10-SNAPSHOT documentation

Yemboaro · July 9, 2024, 10:29am

thanks no, I didn't try but what file of fscrawler I can try to increase the JVM heap I see the documentation that you did send but I don't found what file I am going to write ?

dadoonet · July 9, 2024, 11:07am

It's an environment variable.

FS_JAVA_OPTS="-Xmx521m -Xms521m" bin/fscrawler

Yemboaro · July 10, 2024, 11:22am

ok thank,I will get back to you depending to the result
have a good rest of the day !

Yemboaro · July 18, 2024, 4:06pm

Hello my dear
after following your instructions I still cannot solve my large file problem with fscrawler extraction. In reality when I upload a file of 3 pages or less the extraction is clean but beyond that the extraction does not work I don't know what the problem is
I ask for your help
this is my environment variable that I have craeted
FS_JAVA_OPTS="-Xms4g -Xmx4g" /opt/fscrawler-distribution-2.10-SNAPSHOT/bin/fscrawler

dadoonet · July 18, 2024, 10:04pm

Could you share the file which is failing?

dadoonet · July 19, 2024, 10:20am

So I tried with your file. Here is what I did:

Got the latest build

wget https://s01.oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-distribution/2.10-SNAPSHOT/fscrawler-distribution-2.10-20240711.050438-377.zip
unzip fscrawler-distribution-2.10-20240711.050438-377.zip
cd fscrawler-distribution-2.10-20240711.050438-377
mkdir config
mkdir docs
cp /tmp/rapport_de_stage_simon.pdf docs
bin/fscrawler --config_dir ./config test

It created a file named config/test/_settings.yaml

Which I edited that way (with the right /path/to/docs):

---
name: "test"
fs:
  url: "/path/to/docs"
elasticsearch:
  nodes:
  - url: "https://127.0.0.1:9200"
  ssl_verification: false
  username: "elastic"
  password: "changeme"

Then I went into the fscrawler contrib dir and ran:

docker compose up

And then:

bin/fscrawler --config_dir ./config test

This started:

12:03:56,078 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].
12:03:56,152 WARN  [f.p.e.c.f.s.Elasticsearch] username is deprecated. Use apiKey instead.
12:03:56,153 WARN  [f.p.e.c.f.s.Elasticsearch] password is deprecated. Use apiKey instead.
12:03:56,157 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:03:56,157 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:03:56,200 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
// I removed some logs here
12:03:56,416 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,417 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
12:03:56,451 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,699 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test] for [/path/to/docs] every [15m]
12:03:57,281 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Then, I checked Elasticsearch and ran this in Kibana:

GET test/_search

This gave:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_id": "98732f1a1107aeed36a70d253ddda95",
        "_score": 1,
        "_source": {
          "content": """
// Skipping the content here 
""",
          "meta": {
            "author": "RApport de fin cycle | XXXX",
            "date": "2024-04-24T08:53:10.000+00:00",
            "language": "fr",
            "format": "application/pdf; version=1.7",
            "creator_tool": "Microsoft® Word 2016",
            "created": "2024-04-24T08:53:10.000+00:00"
          },
          "file": {
            "extension": "pdf",
            "content_type": "application/pdf",
            "created": "2024-07-19T09:52:27.000+00:00",
            "last_modified": "2024-07-19T09:52:27.384+00:00",
            "last_accessed": "2024-07-19T09:59:46.539+00:00",
            "indexing_date": "2024-07-19T10:03:56.722+00:00",
            "filesize": 1966947,
            "filename": "rapport_de_stage_simon.pdf",
            "url": "file:///path/to/docs/rapport_de_stage_simon.pdf"
          },
          "path": {
            "root": "9a515553e0fcda342232f65765484df4",
            "virtual": "/rapport_de_stage_simon.pdf",
            "real": "/patho/to/docs/rapport_de_stage_simon.pdf"
          }
        }
      }
    ]
  }
}

So everything worked perfectly, without the need to change the JVM settings...

To diagnose a bit more, could you please share the equivalent line of log:

12:03:56,078 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].

Thanks

Yemboaro · July 19, 2024, 11:53am

ok

09:14:30,803 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [169mb/2.9gb=5.54%], RAM [3.6gb/11.9gb=31.0%], Swap [13.3gb/25.9gb=51.4%].

Yemboaro · July 19, 2024, 12:10pm

this is a local server that I put fscrawler and elastic

12:07:42,620 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [3.7gb/3.8gb=96.14%], RAM [879.3mb/6.6gb=13.0%], Swap [3.9gb/3.9gb=100.0%].

Yemboaro · July 19, 2024, 12:35pm

I want to see the content of file that fscrawler extracts for to can do the search on the content of file in kibana

dadoonet · July 19, 2024, 3:03pm

Yes; That's what FSCrawler did.
I did not want to post it here as it contains some private data but yeah the extraction worked well.

Yemboaro · July 19, 2024, 11:53pm

ok thank
this is the log of my fscrawler what will can to be the causes of 1 faillure ?
23:41:06,893 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [167.9mb/2.9gb=5.51%], RAM [3.8gb/11.9gb=32.69%], Swap [13.2gb/25.9gb=51.33%].
23:41:07,434 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
23:41:07,434 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
23:41:08,123 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.1.0
23:41:08,278 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.1.0
23:41:08,316 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [panga_app_index] for [../blog_files] every [5m]
23:46:08,943 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
23:47:31,100 WARN [o.a.p.p.f.PDSimpleFont] No Unicode mapping for a4 (31) in font DSBNXV+ZapfDingbats
23:47:53,665 WARN [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] There was failures while executing bulk
java.lang.RuntimeException: 1 failures
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkResponse.buildFailureMessage(FsCrawlerBulkResponse.java:72) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchBulkResponse.buildFailureMessage(ElasticsearchBulkResponse.java:64) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerSimpleBulkProcessorListener.afterBulk(FsCrawlerSimpleBulkProcessorListener.java:43) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerAdvancedBulkProcessorListener.afterBulk(FsCrawlerAdvancedBulkProcessorListener.java:48) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerRetryBulkProcessorListener.afterBulk(FsCrawlerRetryBulkProcessorListener.java:51) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.execute(FsCrawlerBulkProcessor.java:145) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.executeWhenNeeded(FsCrawlerBulkProcessor.java:130) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) ~[?:?]
23:47:53,696 WARN [f.p.e.c.f.f.b.FsCrawlerAdvancedBulkProcessorListener] Throttling is activated. Got [0] successive errors so far.

Yemboaro · July 20, 2024, 9:48am

hello yesterday I send the log of mys fscrawler that I met the errors "There was failures while executing bulk" I don't know if you say

dadoonet · July 20, 2024, 12:16pm

Did you reproduce the same exact steps I did?

dadoonet · July 20, 2024, 1:37pm

Also please use a more recent version of Elasticsearch.

Yemboaro · July 30, 2024, 1:00pm

hello my dear
Finally I have found the solution of my problem
The problem was that I had define in the mapping the field data_content that can't content the content of large file
this was my the define of field data_content
"data_content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
I resolve the problem for the field data_content us this
"data_content" : {
"type" : "text"
},
but this play on the dynamic of the search of data
Thank you for your assist !!!

Topic		Replies	Views
FSCrawler Question Elasticsearch	7	3085	March 17, 2017
Does FSCrawler support chunking? Elastic Search crawler	8	148	October 4, 2024
Fscrawler pipeline feature Elasticsearch	11	2242	July 26, 2018
Recommended workflow for indexing many binary docs Elasticsearch	4	766	July 6, 2021
FScrawler: Production Support Available? Elasticsearch	6	446	May 21, 2020

Elasticsearch fscrawler

There is no error as such but it is the fact that the upload of a large file is not processed by the ocr fscrawler therefore I cannot do research on the file content I thought it's my pipeline configuration which doesn't take into account large files here is the content of my setting.yaml file

Related topics

There is no error as such but it is the fact that the upload of a large file is not processed by the ocr fscrawler therefore I cannot do research on the file content I thought it's my pipeline configuration which doesn't take into account large files
here is the content of my setting.yaml file