Elasticsearch fscrawler

hello elastic community
I use elasticsearch and fscrawler for the track of file but my problem is that large file upload is not processed by ocr however small files are processed well so I ask for your help on how to configure the fscrawler pipeline so that it takes into account the files of huge sizes
this my fscrawler pipelines
PUT _ingest/pipeline/blog_app_index_pipeline
{
"description": "fscrawler pipeline for scrawling blog_app files indexing",
"processors": [
{
"set": {
"field": "_id",
"value": "files{{_source.path.virtual}}"
}
},
{
"set": {
"field": "data_source",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_store",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_origin",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "id_origin",
"value": "filename"
}
},
{
"set": {
"field": "id_value",
"value": "{{_source.file.filename}}"
}
},

    {
        "set": {
            "field": "id_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    
   
    {
        "set": {
            "field": "lib_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    {
        "set": {
            "field": "url_fichier",
            "value": "files{{_source.path.virtual}}"
        }
    },
    
    {
        "set": {
            "field": "data_content",
            "value": "{{_source.content}}"
        }
    },
    
    
    {
        "set": {
            "field": "created_at",
            "value": "{{_source.file.created}}"
        }
    },
    {
        "set": {
            "field": "updated_at",
            "value": "{{_source.file.last_modified}}"
        }
    },
    {
        "set": {
            "field": "deleted_at",
            "value": "null"
        }
    }

]

}

From Kibana to Elasticsearch

What is the error message? Is the problem with fscrawler or with the pipeline?

What are the jvm settings for fscrawler if any?

There is no error as such but it is the fact that the upload of a large file is not processed by the ocr fscrawler therefore I cannot do research on the file content I thought it's my pipeline configuration which doesn't take into account large files
here is the content of my setting.yaml file

name: "blog_app_index"
fs:
url: "/my/files"
update_rate: "3m"
excludes:

  • "/~"
    json_support: false
    filename_as_id: false
    add_filesize: true
    remove_deleted: true
    add_as_inner_object: false
    store_source: false
    index_content: true
    attributes_support: false
    raw_metadata: false
    xml_support: false
    index_folders: true
    lang_detect: false
    continue_on_error: false
    ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
    follow_symlinks: false
    elasticsearch:
    pipeline: "blog_app_index_pipeline"
    nodes:
  • url: "url elastic localhost port 9200"
    bulk_size: 100
    flush_interval: "5s"
    byte_size: "10mb"
    what is the probleme in this I don't understand why the upload a large size file the processing of fscrawler(ocr) don't work

Did you try to increase the JVM heap? See JVM Settings — FSCrawler 2.10-SNAPSHOT documentation

thanks no, I didn't try but what file of fscrawler I can try to increase the JVM heap I see the documentation that you did send but I don't found what file I am going to write ?

It's an environment variable.

FS_JAVA_OPTS="-Xmx521m -Xms521m" bin/fscrawler

ok thank,I will get back to you depending to the result
have a good rest of the day !

Hello my dear
after following your instructions I still cannot solve my large file problem with fscrawler extraction. In reality when I upload a file of 3 pages or less the extraction is clean but beyond that the extraction does not work I don't know what the problem is
I ask for your help
this is my environment variable that I have craeted
FS_JAVA_OPTS="-Xms4g -Xmx4g" /opt/fscrawler-distribution-2.10-SNAPSHOT/bin/fscrawler

Could you share the file which is failing?

So I tried with your file. Here is what I did:

Got the latest build

wget https://s01.oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-distribution/2.10-SNAPSHOT/fscrawler-distribution-2.10-20240711.050438-377.zip
unzip fscrawler-distribution-2.10-20240711.050438-377.zip
cd fscrawler-distribution-2.10-20240711.050438-377
mkdir config
mkdir docs
cp /tmp/rapport_de_stage_simon.pdf docs
bin/fscrawler --config_dir ./config test

It created a file named config/test/_settings.yaml

Which I edited that way (with the right /path/to/docs):

---
name: "test"
fs:
  url: "/path/to/docs"
elasticsearch:
  nodes:
  - url: "https://127.0.0.1:9200"
  ssl_verification: false
  username: "elastic"
  password: "changeme"

Then I went into the fscrawler contrib dir and ran:

docker compose up

And then:

bin/fscrawler --config_dir ./config test

This started:

12:03:56,078 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].
12:03:56,152 WARN  [f.p.e.c.f.s.Elasticsearch] username is deprecated. Use apiKey instead.
12:03:56,153 WARN  [f.p.e.c.f.s.Elasticsearch] password is deprecated. Use apiKey instead.
12:03:56,157 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:03:56,157 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:03:56,200 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
// I removed some logs here
12:03:56,416 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,417 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
12:03:56,451 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,699 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test] for [/path/to/docs] every [15m]
12:03:57,281 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Then, I checked Elasticsearch and ran this in Kibana:

GET test/_search

This gave:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_id": "98732f1a1107aeed36a70d253ddda95",
        "_score": 1,
        "_source": {
          "content": """
// Skipping the content here 
""",
          "meta": {
            "author": "RApport de fin cycle | XXXX",
            "date": "2024-04-24T08:53:10.000+00:00",
            "language": "fr",
            "format": "application/pdf; version=1.7",
            "creator_tool": "Microsoft® Word 2016",
            "created": "2024-04-24T08:53:10.000+00:00"
          },
          "file": {
            "extension": "pdf",
            "content_type": "application/pdf",
            "created": "2024-07-19T09:52:27.000+00:00",
            "last_modified": "2024-07-19T09:52:27.384+00:00",
            "last_accessed": "2024-07-19T09:59:46.539+00:00",
            "indexing_date": "2024-07-19T10:03:56.722+00:00",
            "filesize": 1966947,
            "filename": "rapport_de_stage_simon.pdf",
            "url": "file:///path/to/docs/rapport_de_stage_simon.pdf"
          },
          "path": {
            "root": "9a515553e0fcda342232f65765484df4",
            "virtual": "/rapport_de_stage_simon.pdf",
            "real": "/patho/to/docs/rapport_de_stage_simon.pdf"
          }
        }
      }
    ]
  }
}

So everything worked perfectly, without the need to change the JVM settings...

To diagnose a bit more, could you please share the equivalent line of log:

12:03:56,078 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].

Thanks

1 Like

ok

09:14:30,803 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [169mb/2.9gb=5.54%], RAM [3.6gb/11.9gb=31.0%], Swap [13.3gb/25.9gb=51.4%].

this is a local server that I put fscrawler and elastic

12:07:42,620 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [3.7gb/3.8gb=96.14%], RAM [879.3mb/6.6gb=13.0%], Swap [3.9gb/3.9gb=100.0%].

I want to see the content of file that fscrawler extracts for to can do the search on the content of file in kibana

Yes; That's what FSCrawler did.
I did not want to post it here as it contains some private data but yeah the extraction worked well.

ok thank
this is the log of my fscrawler what will can to be the causes of 1 faillure ?
23:41:06,893 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [167.9mb/2.9gb=5.51%], RAM [3.8gb/11.9gb=32.69%], Swap [13.2gb/25.9gb=51.33%].
23:41:07,434 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
23:41:07,434 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
23:41:08,123 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.1.0
23:41:08,278 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.1.0
23:41:08,316 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [panga_app_index] for [../blog_files] every [5m]
23:46:08,943 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
23:47:31,100 WARN [o.a.p.p.f.PDSimpleFont] No Unicode mapping for a4 (31) in font DSBNXV+ZapfDingbats
23:47:53,665 WARN [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] There was failures while executing bulk
java.lang.RuntimeException: 1 failures
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkResponse.buildFailureMessage(FsCrawlerBulkResponse.java:72) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchBulkResponse.buildFailureMessage(ElasticsearchBulkResponse.java:64) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerSimpleBulkProcessorListener.afterBulk(FsCrawlerSimpleBulkProcessorListener.java:43) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerAdvancedBulkProcessorListener.afterBulk(FsCrawlerAdvancedBulkProcessorListener.java:48) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerRetryBulkProcessorListener.afterBulk(FsCrawlerRetryBulkProcessorListener.java:51) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.execute(FsCrawlerBulkProcessor.java:145) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.executeWhenNeeded(FsCrawlerBulkProcessor.java:130) ~[fscrawler-framework-2.10-SNAPSHOT.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) ~[?:?]
23:47:53,696 WARN [f.p.e.c.f.f.b.FsCrawlerAdvancedBulkProcessorListener] Throttling is activated. Got [0] successive errors so far.

hello yesterday I send the log of mys fscrawler that I met the errors "There was failures while executing bulk" I don't know if you say

Did you reproduce the same exact steps I did?

Also please use a more recent version of Elasticsearch.

hello my dear
Finally I have found the solution of my problem
The problem was that I had define in the mapping the field data_content that can't content the content of large file
this was my the define of field data_content
"data_content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
I resolve the problem for the field data_content us this
"data_content" : {
"type" : "text"
},
but this play on the dynamic of the search of data
Thank you for your assist !!!