Regex search on the content field - fscrawler

Hi!

I'm trying to use fscrawler to index my files and kibana to do a regex search in the contents field.
I'm very new to all this, but it seems that to do a regex search, first I need a keyword type field.

Am I right?
If yes, how would I achieve this?

I have tried changing the _settings.json file like so:

"content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
      }
 }

and I did see the content.keyword field in the index_patterns page of kibana, but I dont see this field in the discovery results, and the regex search do not work.

I've tried some regex searches in fields that by default are of the type keyword (file-name) and in this fields the regex search work as expected.

What am I missing?

Thank you very much for your help and time :slight_smile:

Welcome to the forum @arielwb !
You need to change the mapping for the field, not the settings for the index. You can do that with a query:

PUT your-index-name/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "fields": {
          "keyword": {
            "type": "keyword"
        }
      }
    }
  }
}

You will need to reindex your documents when changing a field's mapping. Here's a link to the docs to guide you.

Please note that the links to the documentation are for version 7.8 but you can change the version you need docs for on the right menu:

Thank you for the reply!
That helped a lot :slight_smile:

I managed to get it working doing it the "wrong" way and changing the settings for the index.
Yaaay!

Now I deleted everything to do it the right way (changing the mapping for the field) and I am running into another issue :frowning:

I changed the mapping for the field and reindex and now the field has a ignore_above property (I am assuming is the default value) witch is set to 256 and apparently that is too little for my use case.

I then changed the mapping another time, explicitly setting the ignore_above to 32766.
when I check the mapping for the old index, it shows the expected mapping:

GET /my-index/_mapping
{
  "content": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 32766
      }
    }
  }
}

But checking the mappings of the reindexed version, the ignore-above still is 256:

GET /my-new-index/_mapping
{
  "content": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
}

What am I missing?
What is the greatest value that I can set for this field?
Can I prevent this field to be added by default?

I believe that when I changed the settings for the index this value was not set, and I am not able to reproduce the same result by changing the mapping for the field.

Thanks again for your help!

Here are the responses for the _mapping call of both indexes:

the mapping generated by fscrawler after I changed the mapping (in this latest test I even tried to reduce the ignore_above value to 50)

{
    "crawl": {
        "mappings": {
            "dynamic_templates": [
                {
                    "raw_as_text": {
                        "path_match": "meta.raw.*",
                        "mapping": {
                            "fields": {
                                "keyword": {
                                    "ignore_above": 256,
                                    "type": "keyword"
                                }
                            },
                            "type": "text"
                        }
                    }
                }
            ],
            "properties": {
                "attachment": {
                    "type": "binary"
                },
                "attributes": {
                    "properties": {
                        "group": {
                            "type": "keyword"
                        },
                        "owner": {
                            "type": "keyword"
                        }
                    }
                },
                "content": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 50
                        }
                    }
                },
                "file": {
                    "properties": {
                        "checksum": {
                            "type": "keyword"
                        },
                        "content_type": {
                            "type": "keyword"
                        },
                        "created": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "extension": {
                            "type": "keyword"
                        },
                        "filename": {
                            "type": "keyword",
                            "store": true
                        },
                        "filesize": {
                            "type": "long"
                        },
                        "indexed_chars": {
                            "type": "long"
                        },
                        "indexing_date": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "last_accessed": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "last_modified": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "url": {
                            "type": "keyword",
                            "index": false
                        }
                    }
                },
                "meta": {
                    "properties": {
                        "altitude": {
                            "type": "text"
                        },
                        "author": {
                            "type": "text"
                        },
                        "comments": {
                            "type": "text"
                        },
                        "contributor": {
                            "type": "text"
                        },
                        "coverage": {
                            "type": "text"
                        },
                        "created": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "creator_tool": {
                            "type": "keyword"
                        },
                        "date": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "description": {
                            "type": "text"
                        },
                        "format": {
                            "type": "text"
                        },
                        "identifier": {
                            "type": "text"
                        },
                        "keywords": {
                            "type": "text"
                        },
                        "language": {
                            "type": "keyword"
                        },
                        "latitude": {
                            "type": "text"
                        },
                        "longitude": {
                            "type": "text"
                        },
                        "metadata_date": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "modifier": {
                            "type": "text"
                        },
                        "print_date": {
                            "type": "date",
                            "format": "dateOptionalTime"
                        },
                        "publisher": {
                            "type": "text"
                        },
                        "rating": {
                            "type": "byte"
                        },
                        "relation": {
                            "type": "text"
                        },
                        "rights": {
                            "type": "text"
                        },
                        "source": {
                            "type": "text"
                        },
                        "title": {
                            "type": "text"
                        },
                        "type": {
                            "type": "text"
                        }
                    }
                },
                "path": {
                    "properties": {
                        "real": {
                            "type": "keyword",
                            "fields": {
                                "fulltext": {
                                    "type": "text"
                                },
                                "tree": {
                                    "type": "text",
                                    "analyzer": "fscrawler_path",
                                    "fielddata": true
                                }
                            }
                        },
                        "root": {
                            "type": "keyword"
                        },
                        "virtual": {
                            "type": "keyword",
                            "fields": {
                                "fulltext": {
                                    "type": "text"
                                },
                                "tree": {
                                    "type": "text",
                                    "analyzer": "fscrawler_path",
                                    "fielddata": true
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

and here is the reindex mapping response (note that all the keyword fields have a new ignore_above property)

{
"crawl_reindex": {
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "file": {
                "properties": {
                    "content_type": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "created": {
                        "type": "date"
                    },
                    "extension": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "filename": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "filesize": {
                        "type": "long"
                    },
                    "indexing_date": {
                        "type": "date"
                    },
                    "last_accessed": {
                        "type": "date"
                    },
                    "last_modified": {
                        "type": "date"
                    },
                    "url": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            },
            "meta": {
                "properties": {
                    "created": {
                        "type": "date"
                    },
                    "date": {
                        "type": "date"
                    },
                    "title": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            },
            "path": {
                "properties": {
                    "real": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "root": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "virtual": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            }
        }
    }
}

}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.