Only few hit content files are showing the correct results - custom analyzer

Hi,
I have created a custom analyzer to recognize special characters in my files such as @,-, /, etc
Here is my custom analyzer when i am creating the index:

PUT /s3/
{
   "settings": {
     "index": {
       "number_of_shards": 1,
       "number_of_replicas": 1
      },
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   },
   "mappings": {
     "dynamic": true,
      "properties": {
          "file": {
            "properties": {
              "filename": {
                "type": "keyword",
                "store": true
              }
            }
          },
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
}

Then I am running fsCrawler to index all files in the folder with custom analyzer.
But when I the search query with regular expression as input, only few files are showing the correct highlight content, other files hit content is partial.
For example:

my input term = [0-9]{3}-[0-9]{2}-[0-9]{4}

Python code:

s = Search(using=client, index="s3").query("regexp",  content=pattern)
    s = s.highlight('content')
    for hit in s.scan():
       hit_dict = hit.to_dict()
       hit_dict['meta'] = hit.meta.to_dict()
       print('{}  {} {}  {}'.format(format(hit_dict['meta']['index'], hit_dict['file']['filename'],  hit_dict['meta']['highlight']['content']))

In the ouput I see:

s3  biopicsbcd.csv  ['Avildsen\t1\tLane Frost\tAthlete\tUnknown\t\t0\tMale\tLuke Perry\t\tJeh-1341
<em>324-55-2633</em>25#gjw\n\t84 Charing Cross']

s3  beauty212bcd.csv  ['sfnkjn241@outo.ogt\n\t7.96\t35\t0\t1\t0\t1\t0\t0\t10\t4\t1-2-1827 21/23/2243\n\t11.57\t38\t0\t1\t0\t0\t1\t1\t16\t3\tJeh-1341 `<em>324</em>`']

Both files have exact same SSN, but one is highlighted fully another is highlighted partially.

Could you please tell me, is this because of my custom analyzer or is it a highlighter issue?

-Lisa

anyone? please!!

Please be patient in waiting for responses to your question and refrain from pinging multiple times asking for a response or opening multiple topics for the same question. This is a community forum, it may take time for someone to reply to your question. For more information please refer to the Community Code of Conduct specifically the section "Be patient". Also, please refrain from pinging folks directly, this is a forum and anyone that participates might be able to assist you.

If you are in need of a service with an SLA that covers response times for questions then you may want to consider talking to us about a subscription.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Ideally, don't even think of FSCrawler. Just using standard Elasticsearch API should be enough to provide a way to reproduce and hopefully fix your problem.

Here is the script:

    import re
    from elasticsearch import Elasticsearch
    from elasticsearch_dsl import Search, Index, analyzer, Text, Document

    client = Elasticsearch('127.0.0.1', port=9200)

    def search(pattern):
        s = Search(using=client, index="s3").query("regexp",  content=pattern)
        s = s.highlight('content')
        for hit in s.scan():
           hit_dict = hit.to_dict()
           hit_dict['meta'] = hit.meta.to_dict()
           print('{}  {} {}  {}'.format(hit_dict['file']['filename'], hit_dict['meta']['highlight']['content'], hit_dict['file']['indexing_date'], hit_dict['path']['real']))
       
        return s.scan()

    if  __name__ == "__main__":
        value = "123-35-5252"
        pattern = [0-9]{3}\-[0-9]{2}\-[0-9]{4}

        response = search(pattern)

Step1: deleted s3 index
Step2: assigned new settings using PUT /s3 as shown in the above comment
step3: ran fsCrawler - indexed all files
step4: Ran the script.

Result: only few SSNs are getting highlighted as shown in the above.
RegEx is working since I have data such as 123-234-1243 etc.. but the highlight tags are attached to the right SSN pattern only.

I tried below, but the same results are coming

PUT /new_s3/
{
   "settings": {
     "index": {
       "number_of_shards": 1,
       "number_of_replicas": 1
      },
      "analysis": {
        "tokenizer": {
                "MY_TOKENIZER": { 
                      "type": "char_group", 
                      "tokenize_on_chars":["whitespace", "\n","\\n","\t", ",", ";", ":", "\"", "`", "]", "[", ")", "(", "!", "?", "\\", "<", "|", "+", "=", "~", "&", "%", "^","'", "\u0027"]
                }
            },
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "MY_TOKENIZER"
            }
         }
      }
   },
   "mappings": {
     "dynamic": true,
      "properties": {
          "file": {
            "properties": {
              "filename": {
                "type": "keyword",
                "store": true
              }
            }
          },
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
}

Sample file from the repository:

abds-22    Tomm-34    weir-34
 kashish@gmail.com
 bcd01942@gmail.co.in
bfajk903141@outlook.gov
123-35-5252
 12-32-3525
 13#abc abc!
 abc_ahfheq
 (183)-172-1928
 123-252-2414
 34/11/2414
1.2.2003
1-2-03
1/2/04
1/3/5443
22:17:56

-Lisa

I have created 10000 files using below script: these files are placed in the repository

import os
#--- set your code directory below
dr = "/temp/repoFiles"
#--- set the desired (base) name extension and shebang below (leave it ""if you don't want an automatically set shebang)
name_initial = "new_"
extension = ".pdf"
shebang = "abds-22    Tomm-34    weir-34 \nkashish@gmail.com \n bcd01942@gmail.co.in\nbfajk903141@outlook.gov\n123-35-5252 \n 12-32-3525 \n 13#abc abc! \n abc_ahfheq \n (183)-172-1928\n 123-252-2414 \n 34/11/2414\n1.2.2003\n1-2-03\n1/2/04\n1/3/5443\n22:17:56"
#---

existing = os.listdir(dr)
n = 1
while n<10001:
    file = dr+"/"+name_initial+str(n).zfill(5)+extension
    if os.path.exists(file):
        n = n+1
    else:
        with open(file, "wt") as out:
            out.write(shebang)
        continue

The script I asked is simpler than that. As I said, you can omit FSCrawler for this. Like:

DELETE /s3
PUT /s3
{
  // Your settings
}
POST /s3/_doc
{
  // A sample doc
}
GET /s3/_search
{
  // The search you are running
}
PUT /s3/
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
     },
     "analysis": {
        "analyzer": {
           "my_analyzer": {
              "type": "custom",
              "filter": [
                 "lowercase"
              ],
              "tokenizer": "whitespace"
           }
        }
     }
  },
  "mappings": {
    "dynamic": true,
     "properties": {
         "file": {
           "properties": {
             "filename": {
               "type": "keyword",
               "store": true
             }
           }
         },
       "content": {
         "type": "text",
         "analyzer": "my_analyzer"
       }
     }
   }
}

For creating doc:

I just uploaded a .csv file to the repository with the above content.
I tried recreating the file with POST but its giving me error, I never created a doc using kibana.

POST /s3/_doc/
{
  "filename": "test_doc.csv",
 "content": "abds-22    Tomm-34    weir-34
 kashish@gmail.com
 bcd01942@gmail.co.in
bfajk903141@outlook.gov
123-35-5252
 12-32-3525
 13#abc abc!
 abc_ahfheq
 (183)-172-1928
 123-252-2414"
}

below is the error i saw when i tried the above command in kibana:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse",
    "caused_by": {
      "type": "json_parse_exception",
      "reason": "Illegal unquoted character ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in string value\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@6fd6aab8; line: 3, column: 44]"
    }
  },
  "status": 400
}

Anyways, once the .csv file was indexed ran the below search command.

GET /fs_45.79.170.172/_search/
{
    "from" : 0, "size" : 100,
    "query" : {
        "match": { "content": "123-35-5252" }
        
    },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "content" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"]}
        }
    }
}

But I found why it is giving me only partial highlights for some files.
Its because of fragment_size in highlight.
When I made number_of_fragments=0, the whole file with fully highlighted values are displayed.
For Example:

GET /s3/_search/
{
    "from" : 0, "size" : 100,
    "query" : {
        "match": { "content": "123-63-3525" }
    },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "content" : { "number_of_fragments" : 0,"pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
        }
    }
}

In the output, highlight content is displaying as below:

      "highlight" : {
          "content" : [
            "Taking a stroll with Shane O’Mara is a risky endeavour. \r\nThe neuroscientist is so passionate about walking, and our collective right to go for walks,\r\n that he is determined not to let the slightest unfortunate aspect of urban design break his stride. \r\n So much so, that he has a habit of darting across busy roads as the lights change. \r\n“One of life’s great horrors as you’re walking is waiting for permission to cross the street,”\r\n he tells me, when we are forced to stop for traffic – a rude interruption when, as he says, \r\n “the experience of synchrony when walking together is one of life’s great pleasures”. \r\n He knows this not only through personal experience, but from cold, hard data – \r\n walking makes us healthier, happier and brainier.\r\n\r\nWe are wandering the streets of Dublin discussing O’Mara’s new book,\r\n In Praise of Walking, a backstage tour of what happens in our brains while we perambulate. \r\n Our jaunt begins at the grand old gates of his workplace, Trinity College, \r\n and takes in the Irish famine memorial at St Stephen’s Green, the Georgian mile, \r\n the birthplace of Francis Bacon, the site of Facebook’s new European \r\n mega-HQ and the salubrious seaside dwellings of Sandymount. \r\n \r\n  \r\n lisa@gmail.com\r\n <em>123-63-3525</em>\r\n 12/12/1223\r\n 51-44-2411\r\n 13#anv\r\n John\r\n Drew\r\n josh@outlook.co.in\r\n"
          ]
        }

I tried changing the fragment_size to 150, but it is still showing some partial highlighted results.

My question is, how Elastic search determines which 100 or 150 characters to display? does it start from the highlighted content or ends with it?
In my case, the highlighted part was always at the end. may be that's why, some times its getting cutoff, since the 100 character default limit was over.
Not sure though!

-Lisa

Try:

POST /s3/_doc/
{
  "filename": "test_doc.csv",
 "content": """abds-22    Tomm-34    weir-34
 kashish@gmail.com
 bcd01942@gmail.co.in
bfajk903141@outlook.gov
123-35-5252
 12-32-3525
 13#abc abc!
 abc_ahfheq
 (183)-172-1928
 123-252-2414"""
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.