Hi,
I have created a custom analyzer to recognize special characters in my files such as @,-, /, etc
Here is my custom analyzer when i am creating the index:
PUT /s3/
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"dynamic": true,
"properties": {
"file": {
"properties": {
"filename": {
"type": "keyword",
"store": true
}
}
},
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then I am running fsCrawler to index all files in the folder with custom analyzer.
But when I the search query with regular expression as input, only few files are showing the correct highlight content, other files hit content is partial.
For example:
my input term = [0-9]{3}-[0-9]{2}-[0-9]{4}
Python code:
s = Search(using=client, index="s3").query("regexp", content=pattern)
s = s.highlight('content')
for hit in s.scan():
hit_dict = hit.to_dict()
hit_dict['meta'] = hit.meta.to_dict()
print('{} {} {} {}'.format(format(hit_dict['meta']['index'], hit_dict['file']['filename'], hit_dict['meta']['highlight']['content']))
In the ouput I see:
s3 biopicsbcd.csv ['Avildsen\t1\tLane Frost\tAthlete\tUnknown\t\t0\tMale\tLuke Perry\t\tJeh-1341
<em>324-55-2633</em>25#gjw\n\t84 Charing Cross']
s3 beauty212bcd.csv ['sfnkjn241@outo.ogt\n\t7.96\t35\t0\t1\t0\t1\t0\t0\t10\t4\t1-2-1827 21/23/2243\n\t11.57\t38\t0\t1\t0\t0\t1\t1\t16\t3\tJeh-1341 `<em>324</em>`']
Both files have exact same SSN, but one is highlighted fully another is highlighted partially.
Could you please tell me, is this because of my custom analyzer or is it a highlighter issue?
-Lisa