Searching on multiple fields from Index created by Fscrawler

Jasmeet · October 21, 2018, 8:52am

Hi, I am new to Elasticsearch and have tried out creating index with Fscrawler. After creating custom Analyzers, when i try to search on the fields"content.phonetic" and "content.shingle", i do not get a search hit. Can someone please guide me to where i am going wrong.
The _settings.json file used in Fscrawler is given below

//MY CODE

{
  "settings": {
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "default": {
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["lowercase","custom_edge_ngram"]
				},
				
			"dbl_metaphone":{
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["dbl_metaphone"]
				},	
				
			"shingle":{
			"type": "custom",
			"tokenizer": "standard",
			"filter": ["shingle-filter"]
				}	
			
				},
"filter": {

			"custom_edge_ngram": {
				"type": "edge_ngram",
				"min_gram": 2,
				"max_gram": 10
				},
			"dbl_metaphone": {
              "type": "phonetic",
              "encoder": "double_metaphone"
            },		
			"shingle-filter": {
				"max_shingle_size": "5",
				"min_shingle_size": "2",
				"output_unigrams": "false",
				"type": "shingle"
			         }
			}}}},}
},

 "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "raw_as_text": {
            "path_match": "meta.raw.*",
            "mapping": {
              "type": "text",
              "fields": {
                  "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                		} } } }  } ],
      "properties": {
        "attachment": {
          "type": "binary",
          "doc_values": false
        },
        "attributes": {
          "properties": {
            "group": {
              "type": "keyword"
            },
            "owner": {
              "type": "keyword"
            }
          }
        },
        "content": {
          "type": "text"
		  "index_analyzer": "default",
		  "search_analyzer" : "standard",
				"fields:{
					"phonetic":{
					"type":"text",
					"analyzer":"dbl_metaphone"
								},
					"shingle":{
					"type":"text",
					"analyzer":"shingle"
					}
					}
					},

code continues ----------------------------------

-----ADDING DOCUMENT TO TEST INDEX--------

POST /test/_doc
{
"content": "Learning Elastic Stack 6"
}

TRYING TO SEARCH ON INDEX----->No result on "content.phonetic" and "content.shingle" but result obtained on the field "content".

GET /test/_search
{
 "query": {
    "multi_match": {
       "query": "learning",
       "fields": ["content.phonetic"]
    }
  }
}

dadoonet · October 29, 2018, 6:54am

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Please make it as simple as possible. There is no need here for the tons of mappings you pasted to reproduce your problem.
Also, I suggest that you try the _analyze API to understand how elasticsearch is indexing your text. That's normally helping a lot.

Jasmeet · October 29, 2018, 5:07pm

Thanks, I will try _analyze API as suggested.
Meanwhile, could you please guide me on the following issues-

How to identify which files have been indexed by fscrawler. When i tried indexing some files in a folder, the number of files indexed in elasticsearch is different from the total number of files present in the folder. --Trace option does not clearly list the non-indexed/skipped files or a list of files crawled.
Is there a way to pause crawling in fscralwer or restart from where the last crawling stopped?
I am using fscrawler on a Windows machine and the documentation of fscrawler is not very clear on 'touch' command on files. Will all the files in a folder be crawled irrespective of the date of creation ?
When i restart fscrawler, it appears to crawl all files in the folder irrespective of whether they were previously indexed. Is there a way to tell the fscralwer not to crawl the previously indexed files and look only for new files in the folder?
Thanks in advance
JS

dadoonet · October 31, 2018, 4:13pm

How to identify which files have been indexed by fscrawler

I guess that the only way to do that is by searching in elasticsearch, gathering all the filenames and compare to what tree would give back?

--Trace option does not clearly list the non-indexed/skipped files or a list of files crawled.

I think that --trace or --debug are printing what are the files meant to be indexed and if we skip or not the indexation.

Is there a way to pause crawling in fscralwer or restart from where the last crawling stopped?

No.

github.com/dadoonet/fscrawler

Add support for Pause/Resume FSCrawler

opened 12:04PM - 18 Jan 18 UTC

konovalcev

new

Hello team. Is there any way to force fscrawler to continue his work since the …last file, that was crawled, if some exception happens. Example: we have a very big folder (about 30TB). I run crawler. Crawler has been working for 1 week, for example, about 3 millions files were crawled and after this some network error happens and crawler throws some timeout or another network exception. In this case I have to delete index from elasticsearch and run crawler again or run crawler with --restart flag (and I don't know exactly, what happens here). So I don't know, how to force crawler to continue his work since the last place, when he was stopped. It takes huge time and we are still not able to crawl this folder for at least 2 months, because in case with any network exception we have to start from the start. We use option in config file: "continue_on_error" : true, but it doesn't help in case with unhandled exception. Crawler just stops his work. Is there any way to solve our problem?

I am using fscrawler on a Windows machine and the documentation of fscrawler is not very clear on 'touch' command on files. Will all the files in a folder be crawled irrespective of the date of creation ?

Yes at the first run. Then on the next run, only files that changed will be indexed. Unless you use --restart to restart indexing all files.

When i restart fscrawler, it appears to crawl all files in the folder irrespective of whether they were previously indexed. Is there a way to tell the fscralwer not to crawl the previously indexed files and look only for new files in the folder?

That's what FSCrawler is supposed to be doing. But for that the first run needs to have been completed so FSCrawler can write on disk the last run date.
If this status file is not existing, FSCrawler will start again from scratch and will reindex all.

Jasmeet · October 31, 2018, 5:23pm

Thanks a ton. FS has helped a lot. Hope more features are added to it..

dadoonet · October 31, 2018, 5:42pm

Sure. PR are welcomed!

system · November 28, 2018, 5:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple custom analyzers in a single index Elasticsearch	3	1063	July 5, 2017
Spring elastic search and configuration for Mappings and _settings files Elasticsearch	9	1673	July 6, 2017
Analyzer [suggester] not found for field [mark_text] Elasticsearch	2	1818	June 29, 2018
Multiple search term on multiple field using wild card Elasticsearch	1	309	August 12, 2021
Fuzzy searching on shingles filter getting problem Elasticsearch	1	634	November 6, 2018

Searching on multiple fields from Index created by Fscrawler

Related topics