Fscrawler doesn't seem to index against Includes

here is my _settings file. Basically I'm trying to create an index that includes only files of the types on the includes line. After it finishes, if I go to Kibana and query the index nothing is returned:

{
  "name" : "fullsitedocs",
  "fs" : {
    "url" : "\\\\tst-web-20\\W$\\inetpub\\wwwroot",
    "update_rate" : "15m",
    "includes" : [ "*.htm", "*.html", "*.asp" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.128.128.106",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "fullsiteindex",
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "username" : "elastic",
    "password" : "elastic123"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "10.128.128.106",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

from Kibana I enter:

GET fullsiteindex/_search
{
  "query" : {
    "match" : { "file.extension": "*.htm" }
  }
}

I'm not sure this query can work TBH:

    "match" : { "file.extension": "*.htm" }

Could you try with:

    "match" : { "file.extension": "htm" }

Instead?

But you might be hitting this though:

even if I do

"match_all" : { }

nothing is returned, its an empty index.

fscrawler itself seems to run ok. when I add the --debug flag all the messages look good.

Morning dadoonet. I thought I'd try removing includes completely from my settings file and instead just have the following Excludes statement:

"excludes" : [ "*.css", "*.js", "*.doc", "*.pdf", "*.xls", "*.png", "*.jpg", "*.pps", "*.asa", "*.ico", "*.idx", "*.swf", "*.mp4", "*.scc" ],

then in kibana:

GET fullsiteindex/_search
{
  "query" : {
    "match_all" : {  }
  }
}

returns no results. I have pages with .htm, .html and .asp extensions which I would have expected to have been picked up by fscrawler

Did you restart FSCrawler with —restart option?

—trace option should tell you when files are ignored or indexed. Could you check that html files are not skipped ?

HI Dadoonet, I didn't use the --restart option, so I've just run it again with this option. when it finished, my index is no longer empty, but the excludes as with includes appears not to work. I can see .jpg files for example in my index.

HI dadoonet,

I think I've cracked it :slight_smile: I still cant get the includes to work, but on the exclude's I didn't realise that it was case sensitive. So when my exclude had just *.jpg, it included *.JPG. That was my first issue in actually setting the correct data in the index. My second issue was with my query in Kibana. I cant use "match" but instead use "Wildcard" in the form:

GET fullsiteindex/_search/
{
  "query" : {
    "wildcard": {
      "file.extension": {
        "value": "jpg"
      }
    }
  }
}

This seems to work for my test sample, so I'm now going to run the index against my full data set.

So when my exclude had just *.jpg, it included *.JPG.

Interesting. Wanna open an issue about this? I think I should support it.

About the query you ran, could you share a sample document?

Using wildcard is not what I'd expect. match should work. term might be better though.

Not sure including a sample document would help. I wasn't so much interested in the content, but the type of document. Hence, trying to filter the type of document that goes into the index.

My excludes ended up being:

"excludes" : [ "*.jpg", "*.JPG", "*.css", "*.js", "*.doc", "*.DOC", "*.pdf", "*.PDF", "*.xls", "*.png", "*.pps", "*.asa", "*.db", "*.ico", "*.idx", "*.swf", "*.mp4", "*.scc", "*.xlsx", "*.xlsm", "*.gif", "*.tif", "*.tiff", "*.ttf", "*.rfa", "*.cur", "*.psd", "*.dwt", "*.cfg", "*.fla" ],

After which My index when using my wildcard query returned no entries for any in the excludes list and for those that I wanted (htm, html, txt, asp) returned entries.

I have raised a new issue #458

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.