Searching on multiple fields from Index created by Fscrawler


(Jasmeet) #1

Hi, I am new to Elasticsearch and have tried out creating index with Fscrawler. After creating custom Analyzers, when i try to search on the fields"content.phonetic" and "content.shingle", i do not get a search hit. Can someone please guide me to where i am going wrong.
The _settings.json file used in Fscrawler is given below

//MY CODE

{
  "settings": {
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "default": {
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["lowercase","custom_edge_ngram"]
				},
				
			"dbl_metaphone":{
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["dbl_metaphone"]
				},	
				
			"shingle":{
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["shingle-filter"]
				}	
			
				},
"filter": {

			"custom_edge_ngram": {
				"type": "edge_ngram",
				"min_gram": 2,
				"max_gram": 10
				},
			"dbl_metaphone": {
              "type": "phonetic",
              "encoder": "double_metaphone"
            },		
			"shingle-filter": {
				"max_shingle_size": "5",
				"min_shingle_size": "2",
				"output_unigrams": "false",
				"type": "shingle"
			         }
			}}}},}
},

 "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "raw_as_text": {
            "path_match": "meta.raw.*",
            "mapping": {
              "type": "text",
              "fields": {
                  "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                		} } } }  } ],
      "properties": {
        "attachment": {
          "type": "binary",
          "doc_values": false
        },
        "attributes": {
          "properties": {
            "group": {
              "type": "keyword"
            },
            "owner": {
              "type": "keyword"
            }
          }
        },
        "content": {
          "type": "text"
		  "index_analyzer": "default",
		  "search_analyzer" : "standard",
				"fields:{
					"phonetic":{
					"type":"text",
					"analyzer":"dbl_metaphone"
								},
					"shingle":{
					"type":"text",
					"analyzer":"shingle"
					}
					}
					},

code continues ----------------------------------

-----ADDING DOCUMENT TO TEST INDEX--------

POST /test/_doc
{
"content": "Learning Elastic Stack 6"
}

TRYING TO SEARCH ON INDEX----->No result on "content.phonetic" and "content.shingle" but result obtained on the field "content".

GET /test/_search
{
 "query": {
    "multi_match": {
       "query": "learning",
       "fields": ["content.phonetic"]
    }
  }
}

(David Pilato) #2

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Please make it as simple as possible. There is no need here for the tons of mappings you pasted to reproduce your problem.
Also, I suggest that you try the _analyze API to understand how elasticsearch is indexing your text. That's normally helping a lot.


(Jasmeet) #3

Thanks, I will try _analyze API as suggested.
Meanwhile, could you please guide me on the following issues-

  1. How to identify which files have been indexed by fscrawler. When i tried indexing some files in a folder, the number of files indexed in elasticsearch is different from the total number of files present in the folder. --Trace option does not clearly list the non-indexed/skipped files or a list of files crawled.
  2. Is there a way to pause crawling in fscralwer or restart from where the last crawling stopped?
  3. I am using fscrawler on a Windows machine and the documentation of fscrawler is not very clear on 'touch' command on files. Will all the files in a folder be crawled irrespective of the date of creation ?
  4. When i restart fscrawler, it appears to crawl all files in the folder irrespective of whether they were previously indexed. Is there a way to tell the fscralwer not to crawl the previously indexed files and look only for new files in the folder?
    Thanks in advance
    JS

(David Pilato) #4

How to identify which files have been indexed by fscrawler

I guess that the only way to do that is by searching in elasticsearch, gathering all the filenames and compare to what tree would give back?

--Trace option does not clearly list the non-indexed/skipped files or a list of files crawled.

I think that --trace or --debug are printing what are the files meant to be indexed and if we skip or not the indexation.

Is there a way to pause crawling in fscralwer or restart from where the last crawling stopped?

No.

I am using fscrawler on a Windows machine and the documentation of fscrawler is not very clear on 'touch' command on files. Will all the files in a folder be crawled irrespective of the date of creation ?

Yes at the first run. Then on the next run, only files that changed will be indexed. Unless you use --restart to restart indexing all files.

When i restart fscrawler, it appears to crawl all files in the folder irrespective of whether they were previously indexed. Is there a way to tell the fscralwer not to crawl the previously indexed files and look only for new files in the folder?

That's what FSCrawler is supposed to be doing. But for that the first run needs to have been completed so FSCrawler can write on disk the last run date.
If this status file is not existing, FSCrawler will start again from scratch and will reindex all.


(Jasmeet) #5

Thanks a ton. FS has helped a lot. Hope more features are added to it..


(David Pilato) #6

Sure. PR are welcomed! :wink:


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.