Specifying an elasticsearch index from fscrawler rest api

Hi

I using fscrawler rest API to index PDF,Word,and text files.
I have requirement to provide index name , when i making HTTP request but seems that it taking default _settings.json configuration to index document.
it not uses pass index name in http request.

_Setting,json

       {
      "ok": true,
      "version": "2.6",
      "elasticsearch": "6.8.5",
      "settings": {
        "name": "test",
        "fs": {
          "url": "D:\\Elasticsearch-data\\test",
          "update_rate": "15m",
          "excludes": [
            "*/~*"
          ],
          "json_support": false,
          "filename_as_id": false,
          "add_filesize": true,
          "remove_deleted": true,
          "add_as_inner_object": false,
          "store_source": false,
          "index_content": true,
          "attributes_support": false,
          "raw_metadata": true,
          "xml_support": false,
          "index_folders": true,
          "lang_detect": false,
          "continue_on_error": false,
          "pdf_ocr": true,
          "ocr": {
            "language": "eng"
          }
        },
        "elasticsearch": {
          "nodes": [
            {
              "url": "http://172.16.110.74:9200"
            }
          ],
          "index": "test",
          "index_folder": "test_folder",
          "bulk_size": 100,
          "flush_interval": "5s",
          "byte_size": "10mb"
        },
        "rest": {
          "url": "http://172.16.110.74:7777/fscrawler"
        }
      }
    }

Http Request:

<cfhttp result="resultUpload" method="POST" charset="utf-8" url="http://172.16.110.74:7777/fscrawler/_upload?debug=true">
	<cfhttpparam type="formField" name="index" value="test2"/>
	<cfhttpparam type="File" name="file" file="D:\\Elasticsearch-data\\PDF-Data\\test.txt">
</cfhttp>
<cfdump var="#resultUpload#">

Response

{
  "ok": true,
  "filename": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
  "url": "http://172.16.110.74:9200/test/_doc/da749f125747d9b7b3ef5882e401a71",
  "doc": {
    "content": "heeloo\n",
    "meta": {
      "raw": {
        "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
        "Content-Encoding": "ISO-8859-1",
        "resourceName": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
        "Content-Type": "text/plain; charset=ISO-8859-1"
      }
    },
    "file": {
      "extension": "txt",
      "content_type": "text/plain; charset=ISO-8859-1",
      "indexing_date": "2019-12-19T11:51:03.484+0000",
      "filename": "D:\\Elasticsearch-data\\PDF-Data\\test.txt"
    },
    "path": {
      "virtual": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
      "real": "D:\\Elasticsearch-data\\PDF-Data\\test.txt"
    }
  }
}

Please suggest.

Thanks,

Yeah. This is not supported. Would you like to open an issue?

But as per document https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#specifying-an-elasticsearch-index , we can index document on different indexes.

Yes, i want to open an issue.
Please provide URL where i can log that issue.

OMG! I forgot I implemented this! :rofl:

No need to open an issue then.
Could you share the REST API call you are making then?

I have executed following below request from ColdFusion,I am passing test2 as index name and file location.file is index on elastic but index location is taking from _setting.json file,not what i have provided.

  <cfhttp result="resultUpload" method="POST" charset="utf-8" url="http://172.16.110.74:7777/fscrawler/_upload?debug=true">
    	<cfhttpparam type="formField" name="index" value="test2"/>
    	<cfhttpparam type="File" name="file" file="D:\\Elasticsearch-data\\PDF-Data\\test.txt">
    </cfhttp>
    <cfdump var="#resultUpload#">

Response Received

{
  "ok": true,
  "filename": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
  "url": "http://172.16.110.74:9200/test/_doc/my-id01",
  "doc": {
    "content": "hello elastic \n",
    "meta": {
      "raw": {
        "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
        "Content-Encoding": "ISO-8859-1",
        "resourceName": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
        "Content-Type": "text/plain; charset=ISO-8859-1"
      }
    },
    "file": {
      "extension": "txt",
      "content_type": "text/plain; charset=ISO-8859-1",
      "indexing_date": "2019-12-19T13:11:19.858+0000",
      "filename": "D:\\Elasticsearch-data\\PDF-Data\\test.txt"
    },
    "path": {
      "virtual": "D:\\Elasticsearch-data\\PDF-Data\\test.txt",
      "real": "D:\\Elasticsearch-data\\PDF-Data\\test.txt"
    }
  }
}

I am not able to figure-out,what i am missing.

Thanks

Could you try without ColdFusion, with a curl command or from postman or something like that?
Or could you debug ColdFusion and trace exactly what is sent to FSCrawler behind the scene?

I have debug and tested from ColdFusion and Postman both are giving me same response.seems that fscrawler not taking passed index name.

ColdFusion log:

"Information","ajp-bio-8014-exec-9","12/20/19","09:55:09",,"Starting HTTP request {URL='http://172.16.110.74:7777/fscrawler/_upload?debug=true', method='POST'}"
"Information","ajp-bio-8014-exec-9","12/20/19","09:55:13",,"HTTP request completed  {Status Code=200 ,Time taken=3486 ms}"

PostMan log

{
    "ok": true,
    "filename": "test.txt",
    "url": "http://172.16.110.74:9200/test/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",
    "doc": {
        "content": "hello dilip\n",
        "meta": {
            "raw": {
                "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
                "Content-Encoding": "ISO-8859-1",
                "resourceName": "test.txt",
                "Content-Type": "text/plain; charset=ISO-8859-1"
            }
        },
        "file": {
            "extension": "txt",
            "content_type": "text/plain; charset=ISO-8859-1",
            "indexing_date": "2019-12-20T04:33:53.756+0000",
            "filename": "test.txt"
        },
        "path": {
            "virtual": "test.txt",
            "real": "test.txt"
        }
    }
}

The logs are not showing fields which are supposed to be passed to FSCrawler Rest API.

I checked the code and I do use the content of the index field:

And I also have integration tests about this:

I did not test locally yet but I'm surprised that this does not work.
Is there a way to debug/trace what is happening on the HTTP layer?

I have used fiddler tool to trace HTTP layer and below is raw input/output :

POST http://172.16.110.74:7777/fscrawler/_upload HTTP/1.1
Host: 172.16.110.74:7777
Connection: keep-alive
Content-Length: 386
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
Cache-Control: no-cache
Origin: chrome-extension://fhbjgbiflinjbdggehcddcbncdddomop
Postman-Token: 2e9f2511-c525-bd94-6619-eaebd2148e85
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryVUnEQFRnKGNuQA1S
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9

------WebKitFormBoundaryVUnEQFRnKGNuQA1S
Content-Disposition: form-data; name="index"

test3
------WebKitFormBoundaryVUnEQFRnKGNuQA1S
Content-Disposition: form-data; name="file"; filename="test.txt"
Content-Type: text/plain

hello dilip
------WebKitFormBoundaryVUnEQFRnKGNuQA1S
Content-Disposition: form-data; name="id"

myid-01
------WebKitFormBoundaryVUnEQFRnKGNuQA1S--

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 693

{
  "ok" : true,
  "filename" : "test.txt",
  "url" : "http://172.16.110.74:9200/test/_doc/myid-01",
  "doc" : {
    "content" : "hello dilip\n",
    "meta" : {
      "raw" : {
        "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
        "Content-Encoding" : "ISO-8859-1",
        "resourceName" : "test.txt",
        "Content-Type" : "text/plain; charset=ISO-8859-1"
      }
    },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2019-12-20T10:10:39.448+0000",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

Fscrawler log :

 15:39:30,169 DEBUG [f.p.e.c.f.r.RestApi] Sending document [test.txt] to elasticsearch.
15:39:34,910 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV6] Sending a bulk request of [1] requests
15:39:35,003 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV6] Executed bulk request with [1] requests
15:40:39,448 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [test.txt]
15:40:39,449 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
15:40:39,453 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
15:40:39,454 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
15:40:39,454 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw.entrySet(), iterableWithSize(4));
15:40:39,454 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.DefaultParser"));
15:40:39,455 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Encoding", "ISO-8859-1"));
15:40:39,455 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("resourceName", "test.txt"));
15:40:39,455 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Type", "text/plain; charset=ISO-8859-1"));
15:40:39,456 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation

After analyzing trace log found that Fscrawler not using provided index name rather it pickup setting file index name .
Please suggest.

Here is a test I just ran on the most recent version:

Configuration (./test/rest/_settings.yaml)

---
name: "rest"
fs:
  url: "/tmp/es"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"

I start FSCrawler with:

bin/fscrawler --config_dir ./test --rest rest

Then:

echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "index=my-index" "http://127.0.0.1:8080/fscrawler/_upload"

It gives:

{"ok":true,"filename":"test.txt","url":"http://127.0.0.1:9200/my-index/_doc/dd18bf3a8ea2a3e53e2661c7fb53534"}

And then:

curl http://127.0.0.1:9200/my-index/_doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty

gives:

{
  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "content" : "This is my text\n\n",
    "meta" : { },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2019-12-20T15:06:03.289+0000",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

Everything sounds good on my end.

Please share your elastic search and fscrawler version. i am referring elastic-search-6.8.5 and fscrawler-2.6.

Update to FSCrawler 2.7-SNAPSHOT please.
2.6 is super old.

Thanks.
It work for latest version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.