Elasticsearch fscrawler

hello elastic community
I use elasticsearch and fscrawler for the track of file but my problem is that large file upload is not processed by ocr however small files are processed well so I ask for your help on how to configure the fscrawler pipeline so that it takes into account the files of huge sizes
this my fscrawler pipelines
PUT _ingest/pipeline/blog_app_index_pipeline
{
"description": "fscrawler pipeline for scrawling blog_app files indexing",
"processors": [
{
"set": {
"field": "_id",
"value": "files{{_source.path.virtual}}"
}
},
{
"set": {
"field": "data_source",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_store",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "data_origin",
"value": "FILESYSTEM"
}
},
{
"set": {
"field": "id_origin",
"value": "filename"
}
},
{
"set": {
"field": "id_value",
"value": "{{_source.file.filename}}"
}
},

    {
        "set": {
            "field": "id_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    
   
    {
        "set": {
            "field": "lib_fichier",
            "value": "{{_source.file.filename}}"
        }
    },
    {
        "set": {
            "field": "url_fichier",
            "value": "files{{_source.path.virtual}}"
        }
    },
    
    {
        "set": {
            "field": "data_content",
            "value": "{{_source.content}}"
        }
    },
    
    
    {
        "set": {
            "field": "created_at",
            "value": "{{_source.file.created}}"
        }
    },
    {
        "set": {
            "field": "updated_at",
            "value": "{{_source.file.last_modified}}"
        }
    },
    {
        "set": {
            "field": "deleted_at",
            "value": "null"
        }
    }

]

}

From Kibana to Elasticsearch

What is the error message? Is the problem with fscrawler or with the pipeline?

What are the jvm settings for fscrawler if any?

There is no error as such but it is the fact that the upload of a large file is not processed by the ocr fscrawler therefore I cannot do research on the file content I thought it's my pipeline configuration which doesn't take into account large files
here is the content of my setting.yaml file

name: "blog_app_index"
fs:
url: "/my/files"
update_rate: "3m"
excludes:

  • "/~"
    json_support: false
    filename_as_id: false
    add_filesize: true
    remove_deleted: true
    add_as_inner_object: false
    store_source: false
    index_content: true
    attributes_support: false
    raw_metadata: false
    xml_support: false
    index_folders: true
    lang_detect: false
    continue_on_error: false
    ocr:
    language: "eng+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
    follow_symlinks: false
    elasticsearch:
    pipeline: "blog_app_index_pipeline"
    nodes:
  • url: "url elastic localhost port 9200"
    bulk_size: 100
    flush_interval: "5s"
    byte_size: "10mb"
    what is the probleme in this I don't understand why the upload a large size file the processing of fscrawler(ocr) don't work

Did you try to increase the JVM heap? See JVM Settings — FSCrawler 2.10-SNAPSHOT documentation

thanks no, I didn't try but what file of fscrawler I can try to increase the JVM heap I see the documentation that you did send but I don't found what file I am going to write ?

It's an environment variable.

FS_JAVA_OPTS="-Xmx521m -Xms521m" bin/fscrawler

ok thank,I will get back to you depending to the result
have a good rest of the day !