So I tried with your file. Here is what I did:
Got the latest build
wget https://s01.oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-distribution/2.10-SNAPSHOT/fscrawler-distribution-2.10-20240711.050438-377.zip
unzip fscrawler-distribution-2.10-20240711.050438-377.zip
cd fscrawler-distribution-2.10-20240711.050438-377
mkdir config
mkdir docs
cp /tmp/rapport_de_stage_simon.pdf docs
bin/fscrawler --config_dir ./config test
It created a file named config/test/_settings.yaml
Which I edited that way (with the right /path/to/docs
):
---
name: "test"
fs:
url: "/path/to/docs"
elasticsearch:
nodes:
- url: "https://127.0.0.1:9200"
ssl_verification: false
username: "elastic"
password: "changeme"
Then I went into the fscrawler contrib dir and ran:
docker compose up
And then:
bin/fscrawler --config_dir ./config test
This started:
12:03:56,078 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].
12:03:56,152 WARN [f.p.e.c.f.s.Elasticsearch] username is deprecated. Use apiKey instead.
12:03:56,153 WARN [f.p.e.c.f.s.Elasticsearch] password is deprecated. Use apiKey instead.
12:03:56,157 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:03:56,157 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:03:56,200 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
// I removed some logs here
12:03:56,416 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,417 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
12:03:56,451 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.14.1
12:03:56,699 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [test] for [/path/to/docs] every [15m]
12:03:57,281 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
Then, I checked Elasticsearch and ran this in Kibana:
GET test/_search
This gave:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "test",
"_id": "98732f1a1107aeed36a70d253ddda95",
"_score": 1,
"_source": {
"content": """
// Skipping the content here
""",
"meta": {
"author": "RApport de fin cycle | XXXX",
"date": "2024-04-24T08:53:10.000+00:00",
"language": "fr",
"format": "application/pdf; version=1.7",
"creator_tool": "Microsoft® Word 2016",
"created": "2024-04-24T08:53:10.000+00:00"
},
"file": {
"extension": "pdf",
"content_type": "application/pdf",
"created": "2024-07-19T09:52:27.000+00:00",
"last_modified": "2024-07-19T09:52:27.384+00:00",
"last_accessed": "2024-07-19T09:59:46.539+00:00",
"indexing_date": "2024-07-19T10:03:56.722+00:00",
"filesize": 1966947,
"filename": "rapport_de_stage_simon.pdf",
"url": "file:///path/to/docs/rapport_de_stage_simon.pdf"
},
"path": {
"root": "9a515553e0fcda342232f65765484df4",
"virtual": "/rapport_de_stage_simon.pdf",
"real": "/patho/to/docs/rapport_de_stage_simon.pdf"
}
}
}
]
}
}
So everything worked perfectly, without the need to change the JVM settings...
To diagnose a bit more, could you please share the equivalent line of log:
12:03:56,078 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [559.3mb/9gb=6.07%], RAM [470.4mb/36gb=1.28%], Swap [0b/0b=0.0].
Thanks