Fscrawler doesn't seem to index against Includes

bilpor · October 27, 2017, 3:03pm

here is my _settings file. Basically I'm trying to create an index that includes only files of the types on the includes line. After it finishes, if I go to Kibana and query the index nothing is returned:

{
  "name" : "fullsitedocs",
  "fs" : {
    "url" : "\\\\tst-web-20\\W$\\inetpub\\wwwroot",
    "update_rate" : "15m",
    "includes" : [ "*.htm", "*.html", "*.asp" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.128.128.106",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "fullsiteindex",
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "username" : "elastic",
    "password" : "elastic123"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "10.128.128.106",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

from Kibana I enter:

GET fullsiteindex/_search
{
  "query" : {
    "match" : { "file.extension": "*.htm" }
  }
}

dadoonet · October 27, 2017, 3:15pm

I'm not sure this query can work TBH:

    "match" : { "file.extension": "*.htm" }

Could you try with:

    "match" : { "file.extension": "htm" }

Instead?

But you might be hitting this though:

bilpor · October 27, 2017, 3:27pm

even if I do

"match_all" : { }

nothing is returned, its an empty index.

fscrawler itself seems to run ok. when I add the --debug flag all the messages look good.

bilpor · October 30, 2017, 8:37am

Morning dadoonet. I thought I'd try removing includes completely from my settings file and instead just have the following Excludes statement:

"excludes" : [ "*.css", "*.js", "*.doc", "*.pdf", "*.xls", "*.png", "*.jpg", "*.pps", "*.asa", "*.ico", "*.idx", "*.swf", "*.mp4", "*.scc" ],

then in kibana:

GET fullsiteindex/_search
{
  "query" : {
    "match_all" : {  }
  }
}

returns no results. I have pages with .htm, .html and .asp extensions which I would have expected to have been picked up by fscrawler

dadoonet · October 30, 2017, 9:00am

Did you restart FSCrawler with —restart option?

—trace option should tell you when files are ignored or indexed. Could you check that html files are not skipped ?

bilpor · October 30, 2017, 10:36am

HI Dadoonet, I didn't use the --restart option, so I've just run it again with this option. when it finished, my index is no longer empty, but the excludes as with includes appears not to work. I can see .jpg files for example in my index.

bilpor · October 30, 2017, 2:28pm

HI dadoonet,

I think I've cracked it I still cant get the includes to work, but on the exclude's I didn't realise that it was case sensitive. So when my exclude had just *.jpg, it included *.JPG. That was my first issue in actually setting the correct data in the index. My second issue was with my query in Kibana. I cant use "match" but instead use "Wildcard" in the form:

GET fullsiteindex/_search/
{
  "query" : {
    "wildcard": {
      "file.extension": {
        "value": "jpg"
      }
    }
  }
}

This seems to work for my test sample, so I'm now going to run the index against my full data set.

dadoonet · October 30, 2017, 2:48pm

So when my exclude had just *.jpg, it included *.JPG.

Interesting. Wanna open an issue about this? I think I should support it.

About the query you ran, could you share a sample document?

Using wildcard is not what I'd expect. match should work. term might be better though.

bilpor · October 30, 2017, 3:49pm

Not sure including a sample document would help. I wasn't so much interested in the content, but the type of document. Hence, trying to filter the type of document that goes into the index.

My excludes ended up being:

"excludes" : [ "*.jpg", "*.JPG", "*.css", "*.js", "*.doc", "*.DOC", "*.pdf", "*.PDF", "*.xls", "*.png", "*.pps", "*.asa", "*.db", "*.ico", "*.idx", "*.swf", "*.mp4", "*.scc", "*.xlsx", "*.xlsm", "*.gif", "*.tif", "*.tiff", "*.ttf", "*.rfa", "*.cur", "*.psd", "*.dwt", "*.cfg", "*.fla" ],

After which My index when using my wildcard query returned no entries for any in the excludes list and for those that I wanted (htm, html, txt, asp) returned entries.

bilpor · October 30, 2017, 3:57pm

I have raised a new issue #458

system · November 27, 2017, 3:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FIlesearch using fscrawler Elasticsearch	6	986	October 13, 2017
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1217	March 10, 2022
Fscrawler index large file Elasticsearch	11	768	May 18, 2018
FS Crawler appeared to work but Kibana displays 0 results Elasticsearch	6	494	November 3, 2021
Fscrawler does not index to ES with https Elasticsearch	4	1033	October 27, 2020

Fscrawler doesn't seem to index against Includes

Related topics